Multi-layer protection:
1. Multiple checks - Master failure is verified N times before acting
2. False positive detection:
3. Time-based protection:
failover-time-limit: Prevents repeated failovers within specified timeframe4. Default manual mode:
failover-mode = "manual" (default)5. Replication checks:
check-replication-state = false)Best practice configuration:
failover-mode = "automatic"
failover-limit = 3
failover-time-limit = 10
failover-at-sync = false
failover-max-slave-delay = 30
Reference: /pages/05.configuration/03.failover/01.false-positive-detetection/docs.md
Default behavior: replication-manager waits for the old master to recover.
Reason: First node to restart after total cluster failure could be a delayed slave, leading to significant data loss if promoted.
Configuration parameter: failover-restart-unsafe
When failover-restart-unsafe = false (default):
When failover-restart-unsafe = true:
Recommendation:
false) unless availability is criticaltrue, ensure automated slave startup proceduresReference: /pages/04.architecture/02.failover-workflow/docs.md:57
Failover can be rejected for multiple reasons:
No valid slave available:
failover-max-slave-delay (default: 30 seconds)db-servers-prefered-master requirementsdb-servers-ignored-hosts listFailover limits reached:
failover-limit counter exceeded (default: 5)Time limit not met:
failover-time-limit windowSync status requirement:
failover-at-sync = true but no slaves in SYNC statusCluster state invalid:
Check status: Review logs and cluster state for specific rejection reason.
Reference: /pages/04.architecture/02.failover-workflow/docs.md:20
Parameter: failover-time-limit
Purpose: Prevents repeated failovers from the same root cause (flip-flop protection).
Recommended value: 10 seconds
failover-time-limit = 10
Behavior: If previous failover occurred within this time window, new failover is canceled.
Why this matters:
Set to 0: Unlimited failovers (not recommended for production)
Reset counter: Use console/API to manually reset if legitimate failover needed within time window.
Reference: /pages/05.configuration/03.failover/docs.md:33
Parameter: failover-at-sync
Purpose: Controls whether failover requires semi-sync SYNC status.
When failover-at-sync = false (default):
When failover-at-sync = true:
Recommendation for automatic failover:
failover-at-sync = false with failover-max-slave-delay = 30failover-at-sync = true with failover-max-slave-delay = 0Trade-off: Strict sync requirements reduce availability during network issues or high load.
Reference: /pages/04.architecture/02.failover-workflow/docs.md:39
Problem: Switchover hangs or takes longer than expected.
Common causes:
1. FLUSH TABLES WITH READ LOCK timeout:
switchover-wait-kill (default timeout)Solution:
switchover-wait-kill timeoutSELECT * FROM information_schema.processlist
WHERE command != 'Sleep' AND time > 10;
2. Slave lag:
SHOW SLAVE STATUS lag on all slaves3. Semi-sync timeout:
4. Proxy reconfiguration delays:
Monitoring: Enable higher log verbosity to debug:
log-level = 4
Reference: /pages/05.configuration/04.switchover/docs.md
Failover is prevented when:
Cluster state issues:
failover-max-slave-delayConfiguration limits:
failover-mode = "manual" (default)failover-limit reached (default: 5 failovers)failover-time-limit windowfailover-at-sync = true but no slave in SYNC statusReplication constraints:
db-servers-ignored-hostsdb-servers-prefered-master at highest positioncheck-replication-state = true and replication is brokenAll cluster down:
failover-restart-unsafe = false (default) and old master not recovered yetManual override available: User can force failover via console/API by temporarily disabling checks.
Reference: /pages/04.architecture/02.failover-workflow/docs.md
Problem: When read_only is set on slaves, MySQL/MariaDB automatically disables the Event Scheduler. After a failover or switchover, the new master needs the Event Scheduler re-enabled, but this doesn't happen automatically without configuration.
Solution: Set the following in your replication-manager configuration:
failover-event-scheduler = true
What this does:
When enabled, replication-manager automatically manages the Event Scheduler across topology changes:
SET GLOBAL event_scheduler=1 on the newly promoted masterSET GLOBAL event_scheduler=0 on the old master when it becomes a slave and is set to read-onlyNo post-failover script is needed. The toggle is fully automated on both promotion and demotion.
GUI: The setting can also be toggled from the dashboard under Settings → Failover → Event Scheduler.
API: PUT /api/clusters/{name}/settings/actions/switch/failover-event-scheduler