edit this page

Recovery & Data Loss

15.7.1 How do I recover a failed master?

Recovery methods depend on scenario:

Scenario 1: Master recoverable with GTID consistency

Old master GTID is subset of new master
replication-manager auto-rejoins as slave
No data loss, automatic process

Scenario 2: Master has extra transactions (diverged)

Method A: Flashback (MariaDB with flashback enabled)

autorejoin-flashback = true

Rolls back diverged transactions
Re-syncs with new master
Diverged data is saved

Method B: mysqldump

autorejoin-mysqldump = true

Dumps database from new master
Restores to old master
Slower but reliable

Method C: Physical backup

autorejoin-backup-binlog = true
autorejoin-semisync = false

Uses physical backups (ZFS, LVM snapshots)
Fastest for large databases

Method D: Manual re-provision

Provision old master from new master
Most reliable for complex scenarios

Crash information saved: Check /var/lib/replication-manager/crash*Unixtime*/ for binary logs and election details.

Reference: /pages/05.configuration/03.failover/02.crash-recovery/docs.md

15.7.2 When should I use flashback vs mysqldump recovery?

Flashback recovery:

Requirements:

MariaDB (flashback not available in MySQL)
Binary logs in ROW format
binlog_format = ROW
Flashback enabled: autorejoin-flashback = true

Advantages:

Very fast (seconds to minutes)
No full data copy needed
Reverses diverged transactions

Disadvantages:

Only works for small divergence
Requires MariaDB specific features
May fail with DDL changes

Use when: Small transaction divergence, MariaDB environment, fast recovery priority

mysqldump recovery:

Requirements:

Any MySQL/MariaDB/Percona version
autorejoin-mysqldump = true

Advantages:

Works on any database version
Reliable and predictable
No special prerequisites

Disadvantages:

Slow for large databases (hours for TB+ databases)
Requires full data dump/restore
Locks tables during dump

Use when: Large divergence, MySQL (not MariaDB), guarantee consistency

Physical backup recovery:

Requirements:

ZFS, LVM, or storage snapshots
autorejoin-backup-binlog = true

Advantages:

Fastest for large databases
Minimal overhead
Block-level copy

Disadvantages:

Requires storage infrastructure
More complex setup

Use when: Very large databases, storage supports snapshots, fastest recovery critical

Reference: /pages/05.configuration/03.failover/02.crash-recovery/docs.md

15.7.3 Where can I find crash information after failover?

replication-manager records crash details in multiple locations:

Location 1: Crash directory

/var/lib/replication-manager/crash*Unixtime*/

Contains:

Binary logs from elected master at time of election
Replication state when node was still master
Useful for manual recovery and auditing

Location 2: Cluster state file

/var/lib/replication-manager/cluster_name.json

Contains:

{
  "crashes": [
    {
      "URL": "127.0.0.1:3310",
      "FailoverMasterLogFile": "bin.000001",
      "FailoverMasterLogPos": "459",
      "FailoverSemiSyncSlaveStatus": true,
      "FailoverIOGtid": [{"DomainID": 0, "ServerID": 3310, "SeqNo": 1}],
      "ElectedMasterURL": "127.0.0.1:3311"
    }
  ]
}

Location 3: API endpoint

GET /api/clusters/cluster_name/crashes

When crashes are cleared: When cluster topology returns to no ERROR state.

Reference: /pages/07.howto/03.toubleshoot-crashes/docs.md:26