replication-manager provides continuous, automated high availability for MariaDB and MySQL clusters. It monitors replication topology on every tick, enforces a target topology, and takes corrective action — switchover or failover — only when it is safe to do so. Every decision passes through the same monitoring loop that detected the problem, so no action is ever taken on incomplete or ambiguous information.
On every monitoring loop replication-manager connects to all configured servers and reconstructs the full replication graph: which server is the current primary, which are replicas, what GTID positions they hold, whether replication threads are running, and how far each replica lags behind.
The default target topology is primary/replica (also called master/slave). Other topologies — multi-primary, multi-tier replicas, binlog server, Galera, and more — can be declared as the target, and replication-manager will discover and validate the current state against that target on every loop.
replication-manager will not perform any automated action until the observed topology matches the declared target topology. This guarantee means that a partially-converged cluster, a cluster mid-rejoin, or a cluster where replica lag has not yet resolved will be left alone until the situation is fully understood. The same loop that detects a problem also validates that the cluster is in a known, stable state before acting.
Alongside topology discovery, every monitoring loop runs a series of false-positive checks before concluding that a primary is truly unreachable:
Only when all applicable checks agree that the primary is genuinely lost will replication-manager consider an automated failover. This prevents split-brain and unnecessary failovers caused by transient network events, monitoring host isolation, or slow primaries under load.
By default replication-manager operates in manual mode: when a failure is detected and all false-positive checks confirm the primary is genuinely lost, an alert is sent to all configured channels and the cluster is held in a degraded-but-safe state. No automatic failover is triggered. This gives the supervision team time to assess the situation, check for external causes (network partition, storage failure, planned maintenance), and decide whether to promote a replica or wait for the primary to recover.
failover-mode = "manual" # default — alert and wait for operator action
To enable fully automated failover without operator intervention:
failover-mode = "automatic" # promote the best replica automatically
In automatic mode replication-manager still applies all false-positive and topology checks before acting — the difference is that the promotion step executes without waiting for a human decision.
Recommendation: run
manualmode in production until the team is confident in the false-positive tuning, replication health scores, and alerting pipeline. Switch toautomaticonce those are validated.
A switchover is a controlled primary promotion performed while the current primary is still reachable and healthy. It is the correct operation for planned maintenance, host migrations, software upgrades, and load rebalancing.
During a switchover replication-manager:
Because the old primary is still alive, there is no data loss and the operation is fully reversible — another switchover brings the old primary back as the new leader.
A failover is an emergency promotion triggered when the primary can no longer be reached and all false-positive checks have been exhausted. It is the operation that keeps the cluster writable when a primary host has crashed, lost power, or become permanently partitioned.
During a failover replication-manager:
Because the old primary was not available to flush transactions, a small amount of data loss (transactions committed but not yet replicated) is possible depending on the replication mode configured (semi-sync, binlog-commit-wait, etc.). The number of failovers can be capped to prevent cascading failures.
After a failover the old primary eventually comes back online. Rejoin is the process of safely reintroducing that server as a replica of the new primary.
Rejoin must handle the case where the old primary may have accepted writes that never replicated — those transactions must be rolled back or skipped before the server can safely join. replication-manager automates this using:
The rejoin method is configurable per cluster based on acceptable data risk and recovery time objectives.
Reseeding is the process of provisioning a brand-new server and bringing it into the cluster as a replica. Unlike rejoin (which handles a server that was previously a member), reseeding starts from scratch:
replication-manager automates reseeding using the configured backup tools (mydumper, mariabackup, xtrabackup, or restic snapshots) and can target any server in the cluster as the donor to avoid load on the primary.
| Operation | Trigger | Primary state | Data loss risk |
|---|---|---|---|
| Switchover | Planned / manual | Alive and reachable | None |
| Failover | Automatic / manual | Unreachable | Possible (depends on sync mode) |
| Rejoin | After failover | Returning online | Handled by rollback or restore |
| Reseeding | Adding new node | Running normally | None (backup-based) |