Replication Manager provides two complementary monitoring systems for database integrity: schema monitoring tracks structural changes across your cluster, and data checksumming detects and repairs row-level divergence between the primary and its replicas.
Schema monitoring operates in two phases. First, Replication Manager builds a list of all schema.table pairs. Then, in a background thread, it collects full metadata for each table — column definitions, index structures, collations, and data types — computing a CRC64 fingerprint over the result.
Several mechanisms prevent concurrent collections from running and enforce a minimum interval between runs, controlled by:
| Parameter | Default | Description |
|---|---|---|
monitoring-schema-scan-timeout |
30 |
Maximum seconds for a single schema metadata scan |
All maintenance schedulers are disabled by default and must be opted in explicitly.
| Parameter | Default | Description |
|---|---|---|
monitoring-scheduler |
false |
Master switch — enables the internal scheduler engine |
monitoring-schema-scheduler |
true |
Enables the schema monitoring schedule |
monitoring-schema-scheduler-cron |
0 0 1 1 * * |
Cron schedule for schema collection (6-field format) |
The default cron runs once a day at 1 AM. This conservative default is intentional — clusters with large table dictionaries can have expensive metadata scans, and once-a-day collection is sufficient for most environments.
Fine-grained control over what gets collected:
| Parameter | Default | Description |
|---|---|---|
monitoring-schema-columns |
true |
Collect column definitions |
monitoring-schema-indexes |
true |
Collect index definitions |
monitoring-schema-on-replicas |
true |
Mirror schema collection on all replicas |
monitoring-schema-ignore-tables |
"" |
Comma-separated schema.table patterns to exclude |
Once metadata has been collected on the primary and all replicas, Replication Manager continuously compares CRC64 fingerprints across all cluster nodes to detect structural drift — schema divergence between replicas and their primary, missing or added tables, column type changes, collation changes.
| Parameter | Default | Description |
|---|---|---|
monitoring-schema-change |
true |
Enable schema change detection and alerting |
monitoring-schema-change-script |
"" |
Optional external script invoked on schema change event |
When a MariaDB Spider shard proxy is in use (shardproxy = true), schema changes must be detected and pushed to the Spider proxy node much faster than the default daily schedule allows. In this configuration, schema monitoring runs continuously in the main cluster loop, throttled only by monitoring-schema-scan-timeout. If you use sharding, verify this parameter is tuned appropriately for your table count.
To avoid expensive cold-start scans after a process restart, the full per-server table dictionary — including schema metadata and checksum state — is persisted to disk after every collection. It is automatically reloaded on startup before the first monitoring cycle runs.
The file is stored per server inside the working directory:
{working-dir}/{cluster-name}/{host}_{port}/serverstate.json
For example:
/var/lib/replication-manager/belair/db1.belair.svc.cloud18_3306/serverstate.json
Schema monitoring tells you structure is consistent. Data checksumming tells you rows are consistent.
A checksum run can be initiated in three ways:
POST /api/clusters/{cluster}/settings/actions/switch/monitoring-checksum-scheduler| Parameter | Default | Description |
|---|---|---|
monitoring-checksum-scheduler |
false |
Enable scheduled automatic checksumming |
monitoring-checksum-scheduler-cron |
0 0 2 * * 5 |
Cron schedule — default is every Friday at 2 AM |
Tables are processed serially. For each table, the algorithm:
replication_manager_schema.table_chunk.REPEATABLE READ isolation and binlog_format = STATEMENT to ensure the checksum write itself replicates to all replicas identically.Tables without a PRIMARY KEY or UNIQUE KEY cannot be chunked and are marked N/A.
Each table carries one of five sync states visible in the GUI:
| State | Meaning |
|---|---|
| (blank) | Not yet checksummed |
WA |
Waiting — queued for a checksum run that is already in progress |
PR |
In Progress — currently being checksummed, shows a progress bar |
OK |
All chunks match across all replicas |
ER |
One or more chunks diverge on at least one replica |
NA |
Cannot checksum — no primary/unique key, or a process error occurred |
Some tables diverge by design. Replication Manager uses the replication_manager_schema.jobs table to schedule maintenance tasks on individual nodes, intentionally writing those rows without binary logging so they stay node-local. The replication_manager_schema.table_checksum working table is also ephemeral. Both are excluded by default:
| Parameter | Default |
|---|---|
monitoring-checksum-ignore-tables |
replication_manager_schema.jobs,replication_manager_schema.table_checksum |
Add any additional schema.table pairs to this comma-separated list to suppress known-good divergences.
The chunking engine supports the full range of primary key data types and compositions. The regression test suite at share/sql/checksum-repair-test.sql covers:
Single-column keys: TINYINT, SMALLINT, INT, BIGINT, UNSIGNED BIGINT, VARCHAR, CHAR, UUID CHAR(36), UUID BINARY(16), DATE, DATETIME, TIMESTAMP, DECIMAL, ENUM
Composite keys: INT + VARCHAR, BIGINT + DATE, UUID CHAR(36) + TINYINT, ENUM + DATE, ENUM + DATETIME(6), ENUM + TIMESTAMP(3), ENUM + YEAR
When a table is in ER state, Replication Manager stores the exact range predicate for every divergent chunk on every affected replica. Repair can be triggered from the GUI or API at the table, schema, or cluster level.
The repair process per chunk:
START TRANSACTION;
CREATE OR REPLACE TEMPORARY TABLE tmp_repair
AS SELECT * FROM schema.table WHERE {chunk_range} FOR UPDATE;
DELETE FROM schema.table WHERE {chunk_range};
INSERT INTO schema.table SELECT * FROM tmp_repair;
COMMIT;
This transaction runs on the primary with binlog_format = ROW, so replicas receive an exact row-level copy of the repaired chunk via standard replication — no direct writes to replicas are required.
After all chunks are repaired, the table is automatically re-checksummed to confirm convergence.
The Shards tab provides a table-level view of the entire schema dictionary with:
OK, ER, NA, PR with live progress bar)The Graph view within the Shards tab visualises relationships between tables using three edge types:
| Edge type | Style | Description |
|---|---|---|
| Foreign key | Solid | Declared FOREIGN KEY constraints |
| Name match | Dashed | Columns sharing the same name across tables (implicit join candidates) |
| Workload query | Dotted | Tables co-appearing in captured query plans |
Name-match edges are off by default as they can produce a large number of implicit edges in schemas with common column naming conventions such as id, created_at, or status. Toggle each edge type independently using the filter controls in the graph toolbar.
When a checksum run completes, each replica that has one or more divergent chunks is marked with isDataDiverge = true in the cluster state. This flag is evaluated every time Replication Manager selects a candidate for failover or switchover.
During candidate election, any replica flagged as divergent is excluded and logged with:
ERR00103: Skip slave in election {server} data diverge in checksum
This exclusion is tracked in the election audit trail under IgnoredDataDivergence so operators can see exactly why a replica was bypassed.
If divergence has been detected on all replicas, no candidate can be found and the operation is blocked with:
ERR00032: No candidates found in slaves list
This escalation from a per-server warning to a cluster-level error is intentional: a single divergent replica is a warning that should be investigated and repaired, but a fully divergent replica set is a signal that automatic promotion would risk serving stale or inconsistent data.
Replication Manager's core design principle is availability first. Blocking all failovers because of data divergence would be the wrong default for most production environments where a brief inconsistency is preferable to an extended outage.
| Parameter | Default | Description |
|---|---|---|
failover-divergent-data |
true |
Allow failover and switchover to replicas with divergent checksum data |
With the default of true, divergent replicas participate in candidate election normally. The ERR00103 warning is still raised and recorded — giving operators full visibility — but it does not block promotion.
Setting this to false enforces a strict consistency policy: no replica with a known data divergence will ever be promoted. This is appropriate for environments where data correctness takes absolute precedence over availability, such as financial ledgers, compliance-critical datasets, or systems where serving stale data has regulatory consequences.
If you run with failover-divergent-data = false, make sure your operational process accounts for the fact that an unrepaired divergence will block all automated failover. The recommended cycle is:
monitoring-checksum-scheduler = true)ERR00103 alerts — each one identifies a specific replica and set of divergent chunksisDataDiverge is clearedThe failover-divergent-data toggle is available in Settings → Replication → Failover under the label "Failover enable on divergent data", allowing it to be changed at runtime without a configuration file edit or restart.