Problem: Connection failures or socket errors when using localhost in db-servers-hosts.
Cause: On Unix systems, MySQL treats "localhost" specially and attempts a Unix socket connection rather than a TCP connection. This differs from network-based connections.
Solution: Use the IP address 127.0.0.1 instead of localhost in your configuration:
db-servers-hosts = "127.0.0.1:3306,127.0.0.1:3307"
db-servers-credential = "root:password"
Ensure database user privileges are granted for 127.0.0.1 rather than localhost:
GRANT ALL PRIVILEGES ON *.* TO 'root'@'127.0.0.1' IDENTIFIED BY 'password';
Reference: /pages/05.configuration/02.databases/docs.md:25
Requirement: replication-manager requires GTID-enabled replication.
Supported versions:
Configuration: Ensure GTID is enabled on all database nodes:
gtid_mode = ON # MySQL
gtid_domain_id = 1 # MariaDB
enforce_gtid_consistency = ON # MySQL
Reference: Installation documentation
Problem: HAProxy statistics not appearing in replication-manager dashboard or incorrect connection counts.
Cause: HAProxy versions earlier than 1.7 do not provide complete statistics information.
Solution: Upgrade to HAProxy 1.7 or later. Verify HAProxy statistics socket is accessible:
haproxy-servers = "127.0.0.1:3310"
haproxy-write-port = 3306
haproxy-read-port = 3307
Reference: /pages/05.configuration/06.routing/01.haproxy/docs.md
Problem: Changes made via the web UI or API are lost when replication-manager restarts.
Cause: Dynamic configuration saving is disabled by default.
Solution: Enable configuration persistence in the [Default] section of your config file:
[Default]
monitoring-save-config = true
With this enabled, changes are saved to /var/lib/replication-manager/cluster_name/config.toml (2.x) or /root/.config/replication-manager/cluster_name/config.toml (3.x).
Reference: /pages/02.installation/02.configuration/docs.md:21
Major changes in 3.x:
Configuration location changes:
/etc/replication-manager/config.toml/root/.config/replication-manager/config.toml (active config)Docker volume requirements:
# Add third volume mount for 3.x
docker run -v /home/repman/etc:/etc/replication-manager:rw \
-v /home/repman/data:/var/lib/replication-manager:rw \
-v /home/repman/config:/root/.config/replication-manager:rw
Parameter renames (partial list):
monitoring-config-rewrite → monitoring-save-confighosts → db-servers-hostsrpluser → replication-credentialprefmaster → db-servers-prefered-masterMigration steps:
Reference: /pages/02.installation/04.migration/docs.md
Version 3.x uses a two-part structure: [Default] section for global settings and separate cluster sections.
Minimal configuration for single cluster:
[Default]
monitoring-save-config = true
include = "/etc/replication-manager/cluster.d"
[cluster1]
title = "cluster1"
prov-orchestrator = "onpremise"
db-servers-hosts = "127.0.0.1:3306,127.0.0.1:3307"
db-servers-prefered-master = "127.0.0.1:3306"
db-servers-credential = "root:password"
replication-credential = "repl_user:repl_password"
Required cluster parameters:
title: Cluster nameprov-orchestrator: Provisioning orchestrator (typically "onpremise")db-servers-hosts: Comma-separated database server listdb-servers-prefered-master: Preferred master for electionsdb-servers-credential: Database admin credentials (user:password)replication-credential: Replication user credentials (user:password)Alternative (legacy style): For simple deployments, cluster parameters can be in [Default] section:
[Default]
title = "ClusterTest"
db-servers-hosts = "127.0.0.1:3306,127.0.0.1:3307"
db-servers-credential = "root:password"
replication-credential = "repl_user:repl_password"
failover-mode = "manual"
Reference: /pages/02.installation/02.configuration/docs.md:27
Short answer: No.
Detailed explanation: Semi-sync SYNC status does not guarantee the old master is replication-consistent with the cluster after a crash or shutdown.
Known issues:
What semi-sync guarantees: No client applications have seen transactions that didn't reach a replica, but the master's binary log may contain additional events not yet replicated.
Impact: In heavy write scenarios, crashed masters often require re-provisioning from another node rather than rejoining the cluster.
Recommendation: Use rpl_semi_sync_master_wait_point = AFTER_COMMIT (default) to ensure client-visible transactions are safer, even though it may leave more transactions in the binary log after a crash.
Reference: /pages/07.howto/01.replication-best-practice/docs.md:44
Problem: Rejoining slaves during switchover fails when using expire_logs_days after extended periods without writes.
Cause: Binary logs are automatically purged based on expire_logs_days, which may remove logs needed for slave rejoin after the cluster has been idle.
Related bug: MDEV-10869
Solution:
expire_logs_days value to retain logs longerbinlog_expire_logs_seconds (MariaDB 10.6+) for finer controlWorkaround: If switchover fails, you may need to re-provision affected slaves from the new master.
Reference: Current FAQ
Parameter: rpl_semi_sync_master_wait_point
AFTER_COMMIT (recommended):
AFTER_SYNC:
Recommendation: Use AFTER_COMMIT for safer client experience.
Reference: /pages/07.howto/01.replication-best-practice/docs.md:50
Problem: Applications with SUPER privileges can write to a read-only master during switchover.
Cause: MariaDB does not have MySQL's super_read_only protection. The READ_ONLY flag does not block SUPER users from writing.
Related bug: MDEV-9458
Impact: During switchover:
READ_ONLYFLUSH TABLES WITH READ LOCKMitigation:
READ_ONLY slavemax_connections during switchover to limit queued connectionsBest practice: Don't grant SUPER privileges to application users.
Reference: Current FAQ
Problem: MySQL server hangs during shutdown when using GTID with autocommit=0 and super_read_only=ON.
Affected versions:
Fixed in:
Cause: Transaction attempting to save GTIDs to mysql.gtid_executed table fails because super_read_only=ON prevents the update. With autocommit=0, the transaction never completes, blocking shutdown.
Solution: Upgrade to MySQL 5.7.25/8.0.14 or later.
Workaround (if upgrade not possible): Set autocommit=1 or avoid super_read_only on slaves.
Bug reference: Bug #28183718
Reference: Current FAQ
Problem: Semi-sync timeout causes workload changes and increased failover risk.
Behavior: When rpl_semi_sync_master_timeout (default: 10 seconds) is reached:
Impact before timeout: Semi-sync slows workload to network replication speed, creating backpressure on writes.
Impact after timeout:
Monitoring: replication-manager tracks "In Sync" status and SLA metrics to determine when safe failover windows exist.
Reference: /pages/07.howto/01.replication-best-practice/docs.md:46
Problem: Relay slaves cannot automatically reconnect in multi-tier replication when their intermediate master fails.
Cause: replication-manager does not automatically manage relay node failures in multi-tier topologies.
Limitation: If you have:
Master → Relay → Slave
And the Relay node dies, the Slave cannot automatically reconnect to Master.
Workaround: Manually repoint slaves to the new topology after relay node failure.
Design consideration: Multi-tier topologies require additional operational procedures for relay node failures.
Reference: /pages/05.configuration/05.replication/docs.md
Restriction: Do not use server-id = 1000 on any database node in your cluster.
Reason: replication-manager reserves server-id = 1000 for binlog server operations during crash recovery.
Impact: Using server-id 1000 in your cluster will cause:
Solution: Use any server-id except 1000. Common practice is sequential IDs: 3306, 3307, 3308, etc.
Reference: /pages/05.configuration/03.failover/02.crash-recovery/docs.md
Multi-layer protection:
1. Multiple checks - Master failure is verified N times before acting
2. False positive detection:
3. Time-based protection:
failover-time-limit: Prevents repeated failovers within specified timeframe4. Default manual mode:
failover-mode = "manual" (default)5. Replication checks:
check-replication-state = false)Best practice configuration:
failover-mode = "automatic"
failover-limit = 3
failover-time-limit = 10
failover-at-sync = false
failover-max-slave-delay = 30
Reference: /pages/05.configuration/03.failover/01.false-positive-detetection/docs.md
Default behavior: replication-manager waits for the old master to recover.
Reason: First node to restart after total cluster failure could be a delayed slave, leading to significant data loss if promoted.
Configuration parameter: failover-restart-unsafe
When failover-restart-unsafe = false (default):
When failover-restart-unsafe = true:
Recommendation:
false) unless availability is criticaltrue, ensure automated slave startup proceduresReference: /pages/04.architecture/02.failover-workflow/docs.md:57
Failover can be rejected for multiple reasons:
No valid slave available:
failover-max-slave-delay (default: 30 seconds)db-servers-prefered-master requirementsdb-servers-ignored-hosts listFailover limits reached:
failover-limit counter exceeded (default: 5)Time limit not met:
failover-time-limit windowSync status requirement:
failover-at-sync = true but no slaves in SYNC statusCluster state invalid:
Check status: Review logs and cluster state for specific rejection reason.
Reference: /pages/04.architecture/02.failover-workflow/docs.md:20
Parameter: failover-time-limit
Purpose: Prevents repeated failovers from the same root cause (flip-flop protection).
Recommended value: 10 seconds
failover-time-limit = 10
Behavior: If previous failover occurred within this time window, new failover is canceled.
Why this matters:
Set to 0: Unlimited failovers (not recommended for production)
Reset counter: Use console/API to manually reset if legitimate failover needed within time window.
Reference: /pages/05.configuration/03.failover/docs.md:33
Parameter: failover-at-sync
Purpose: Controls whether failover requires semi-sync SYNC status.
When failover-at-sync = false (default):
When failover-at-sync = true:
Recommendation for automatic failover:
failover-at-sync = false with failover-max-slave-delay = 30failover-at-sync = true with failover-max-slave-delay = 0Trade-off: Strict sync requirements reduce availability during network issues or high load.
Reference: /pages/04.architecture/02.failover-workflow/docs.md:39
Problem: Switchover hangs or takes longer than expected.
Common causes:
1. FLUSH TABLES WITH READ LOCK timeout:
switchover-wait-kill (default timeout)Solution:
switchover-wait-kill timeoutSELECT * FROM information_schema.processlist
WHERE command != 'Sleep' AND time > 10;
2. Slave lag:
SHOW SLAVE STATUS lag on all slaves3. Semi-sync timeout:
4. Proxy reconfiguration delays:
Monitoring: Enable higher log verbosity to debug:
log-level = 4
Reference: /pages/05.configuration/04.switchover/docs.md
Failover is prevented when:
Cluster state issues:
failover-max-slave-delayConfiguration limits:
failover-mode = "manual" (default)failover-limit reached (default: 5 failovers)failover-time-limit windowfailover-at-sync = true but no slave in SYNC statusReplication constraints:
db-servers-ignored-hostsdb-servers-prefered-master at highest positioncheck-replication-state = true and replication is brokenAll cluster down:
failover-restart-unsafe = false (default) and old master not recovered yetManual override available: User can force failover via console/API by temporarily disabling checks.
Reference: /pages/04.architecture/02.failover-workflow/docs.md
Limitation: Two-node clusters lack fault tolerance for automatic failover decisions.
Problems with two nodes:
Split-brain risk:
No quorum:
Single point of failure:
Recommendation: Use minimum three nodes:
Alternative: Two-node with external arbitrator to break ties.
Reference: /pages/05.configuration/05.replication/docs.md
Supported but with constraints:
No serialized isolation:
Deadlock risk:
Certification-based replication:
Recommendations:
Reference: /pages/04.architecture/03.topologies/06.multi-master-galera/docs.md
Required configuration for master-master (multi-master) topology:
Critical setting:
read_only = 1
Must be set in MariaDB configuration file (my.cnf), not just dynamically.
How it works:
read_only moderead_only = 0 on active master onlyread_only = 1 on standby masterWithout read_only = 1 in config:
Additional protection:
Reference: /pages/05.configuration/05.replication/docs.md
Problem: Relay node failures are not automatically managed.
Multi-tier example:
DC1: Master → Relay1
DC2: Relay1 → Slave1, Slave2
If Relay1 crashes:
Limitation: Designed for master failure, not intermediate relay failures.
Workaround options:
Option 1: Manually repoint slaves to master
CHANGE MASTER TO MASTER_HOST='master', ...
Option 2: Use scripts with failover-post-script to handle relay failures
Option 3: Avoid multi-tier topologies in critical paths
When multi-tier makes sense:
Reference: /pages/05.configuration/05.replication/docs.md
Default credentials: admin:repman (INSECURE)
Security risk: Default credentials must be changed before production use.
Configuration parameter:
[Default]
api-credentials = "myuser:mypassword"
Effect:
CLI usage with custom credentials:
replication-manager-cli --user=myuser --password=mypassword status
Best practice: Use encrypted passwords (see password encryption question below).
Reference: /pages/05.configuration/07.security/docs.md:31
Three-step process:
Step 1: Generate encryption key (as root)
replication-manager keygen
This creates an encryption key accessible only to root.
Step 2: Encrypt your password
replication-manager password secretpass
Output:
Encrypted password hash: 50711adb2ef2a959577edbda5cbe3d2ace844e750b20629a9bcb
Step 3: Use encrypted password in config
db-servers-credential = "root:50711adb2ef2a959577edbda5cbe3d2ace844e750b20629a9bcb"
replication-credential = "repl:50711adb2ef2a959577edbda5cbe3d2ace844e750b20629a9bcb"
Automatic decryption: When replication-manager starts and detects the encryption key, passwords are automatically decrypted.
Security: Encryption key is only readable by root, providing basic password obfuscation.
Reference: /pages/05.configuration/07.security/docs.md:6
Two modes available:
Mode 1: config_store_v2 (store credentials in Vault secret)
[mycluster]
vault-server-addr = "http://vault.example.com:8200"
vault-auth = "approle"
vault-role-id = "your-role-id"
vault-secret-id = "your-secret-id"
vault-mode = "config_store_v2"
vault-mount = "kv"
db-servers-credential = "applications/repman"
replication-credential = "applications/repman"
Create Vault secret:
vault kv put kv/applications/repman \
db-servers-credential=root:password \
replication-credential=repl:password
Mode 2: database_engine (automatic password rotation)
[mycluster]
vault-mode = "database_engine"
db-servers-credential = "database/static-creds/repman-monitor"
replication-credential = "database/static-creds/repman-replication"
Configure Vault database role:
vault write database/config/my-mysql-database \
plugin_name=mysql-database-plugin \
connection_url="{{username}}:{{password}}@tcp(127.0.0.1:3306)/" \
allowed_roles="repman-monitor,repman-replication" \
username="vaultuser" \
password="vaultpass"
vault write database/static-roles/repman-monitor \
db_name=my-mysql-database \
username="repman" \
rotation_period=600
Automatic rotation: Vault rotates passwords at specified interval; replication-manager fetches new passwords on authentication errors.
Reference: /pages/05.configuration/07.security/docs.md:123
Yes, for production deployments.
Default behavior: replication-manager ships with self-signed certificates for HTTPS and API access.
Security risk: Self-signed certificates:
Configuration parameters:
[Default]
monitoring-ssl-cert = "/path/to/your/server.crt"
monitoring-ssl-key = "/path/to/your/server.key"
Generate proper certificates:
# RSA certificate
openssl genrsa -out server.key 2048
openssl req -new -x509 -sha256 -key server.key -out server.crt -days 3650
# ECDSA certificate (recommended)
openssl ecparam -genkey -name secp384r1 -out server.key
openssl req -new -x509 -sha256 -key server.key -out server.crt -days 3650
Best practice: Use certificates from your organization's PKI or a trusted CA like Let's Encrypt.
Reference: /pages/05.configuration/07.security/docs.md:40
Problem: Database users not appearing in ProxySQL configuration.
Cause: User bootstrap feature not enabled or configured incorrectly.
Configuration parameter:
proxysql-bootstrap-users = true
Behavior when enabled:
Manual mode (proxysql-bootstrap-users = false):
Troubleshooting:
proxysql-servers parameter points to admin interfaceReference: /pages/05.configuration/06.routing/01.proxysql/docs.md
Bootstrap mode (proxysql-bootstrap-users = true):
Manual mode (proxysql-bootstrap-users = false):
Recommendation:
Reference: /pages/05.configuration/06.routing/01.proxysql/docs.md
Problem: MaxScale's internal monitor has built-in delays before detecting topology changes.
Cause: MaxScale uses polling intervals to detect master failures and slave status changes, typically 1-2 seconds.
Impact during failover:
Solution: MaxScale integration with shortcutting:
replication-manager can update MaxScale directly via its API, bypassing monitor delays.
Configuration:
maxscale-servers = "127.0.0.1:3306"
maxscale-write-port = 3306
maxscale-read-port = 3307
Benefit: Near-instant routing updates during switchover/failover.
Reference: /pages/05.configuration/06.routing/02.maxscale/docs.md
Decision matrix:
HAProxy:
ProxySQL:
MaxScale:
Consul:
Recommendation: Start with HAProxy for simplicity, move to ProxySQL for advanced features.
Reference: Configuration documentation for each proxy type
Recovery methods depend on scenario:
Scenario 1: Master recoverable with GTID consistency
Scenario 2: Master has extra transactions (diverged)
Method A: Flashback (MariaDB with flashback enabled)
autorejoin-flashback = true
Method B: mysqldump
autorejoin-mysqldump = true
Method C: Physical backup
autorejoin-backup-binlog = true
autorejoin-semisync = false
Method D: Manual re-provision
Crash information saved: Check /var/lib/replication-manager/crash*Unixtime*/ for binary logs and election details.
Reference: /pages/05.configuration/03.failover/02.crash-recovery/docs.md
Flashback recovery:
Requirements:
binlog_format = ROWautorejoin-flashback = trueAdvantages:
Disadvantages:
Use when: Small transaction divergence, MariaDB environment, fast recovery priority
mysqldump recovery:
Requirements:
autorejoin-mysqldump = trueAdvantages:
Disadvantages:
Use when: Large divergence, MySQL (not MariaDB), guarantee consistency
Physical backup recovery:
Requirements:
autorejoin-backup-binlog = trueAdvantages:
Disadvantages:
Use when: Very large databases, storage supports snapshots, fastest recovery critical
Reference: /pages/05.configuration/03.failover/02.crash-recovery/docs.md
replication-manager records crash details in multiple locations:
Location 1: Crash directory
/var/lib/replication-manager/crash*Unixtime*/
Contains:
Location 2: Cluster state file
/var/lib/replication-manager/cluster_name.json
Contains:
{
"crashes": [
{
"URL": "127.0.0.1:3310",
"FailoverMasterLogFile": "bin.000001",
"FailoverMasterLogPos": "459",
"FailoverSemiSyncSlaveStatus": true,
"FailoverIOGtid": [{"DomainID": 0, "ServerID": 3310, "SeqNo": 1}],
"ElectedMasterURL": "127.0.0.1:3311"
}
]
}
Location 3: API endpoint
GET /api/clusters/cluster_name/crashes
When crashes are cleared: When cluster topology returns to no ERROR state.
Reference: /pages/07.howto/03.toubleshoot-crashes/docs.md:26
Increase log verbosity dynamically via API:
replication-manager-cli api \
url=https://127.0.0.1:3000/api/clusters/cluster_name/settings/switch/verbosity
Or configure in config file:
[Default]
log-level = 4
Log levels:
Per-module logging (3.1+):
log-level = 2 # Global default
log-level-backup-stream = 4 # Debug backups
log-level-proxy = 1 # Minimal proxy logs
log-level-heartbeat = 4 # Debug heartbeat
Collecting debug information for support:
# Increase verbosity
replication-manager-cli api url=.../verbosity
# Reproduce issue
# Collect internal state
replication-manager-cli show > state.json
# Attach to support ticket:
# - /var/log/replication-manager.log
# - state.json
Reference: /pages/07.howto/03.toubleshoot-crashes/docs.md:7
Change in version 3.x: Performance Schema monitoring enabled by default.
New defaults in 3.x:
monitoring-performance-schema-mutex = true
monitoring-performance-schema-latch = true
monitoring-performance-schema-memory = true
Impact: Increased database load from performance schema queries, especially on:
Solution if overhead is problematic:
Disable specific monitors:
monitoring-performance-schema-mutex = false
monitoring-performance-schema-latch = false
monitoring-performance-schema-memory = false
Or disable performance schema entirely on database:
[mysqld]
performance_schema = OFF
Trade-off: Disabling reduces monitoring visibility into database internals.
Recommendation: Monitor database CPU/load after upgrade; disable only if overhead is measurable.
Reference: /pages/02.installation/04.migration/docs.md:156
Collect internal status:
replication-manager-cli show
Output includes internal state of:
Filter specific class:
replication-manager-cli show --get=servers
replication-manager-cli show --get=crashes
For support tickets, attach:
show command (JSON format)/var/log/replication-manager.log# Enable verbose logging
replication-manager-cli api url=.../verbosity
# Reproduce issue
# Collect logsSubmit to: https://github.com/signal18/replication-manager/issues
Reference: /pages/07.howto/03.toubleshoot-crashes/docs.md:14
MariaDB 10.1+ optimistic parallel replication:
slave_parallel_mode = optimistic
slave_domain_parallel_threads = 4 # Set to number of CPU cores
slave_parallel_threads = 4 # Set to number of CPU cores
expire_logs_days = 5
sync_binlog = 1
log_slave_updates = ON
Benefits:
Why this matters:
Verification:
SHOW VARIABLES LIKE 'slave_parallel%';
Reference: /pages/07.howto/01.replication-best-practice/docs.md:14
MariaDB semi-sync configuration:
plugin_load = "semisync_master.so;semisync_slave.so"
rpl_semi_sync_master = ON
rpl_semi_sync_slave = ON
loose_rpl_semi_sync_master_enabled = ON
loose_rpl_semi_sync_slave_enabled = ON
rpl_semi_sync_master_timeout = 10
rpl_semi_sync_master_wait_point = AFTER_COMMIT
Important notes:
Expected warning on slaves:
Timeout value (10 seconds):
Wait point:
AFTER_COMMIT (default) for client safetyAFTER_SYNC despite fewer binlog transactions after crashBenefits:
Reference: /pages/07.howto/01.replication-best-practice/docs.md:30
Parameter: monitoring-enforce-best-practices
When enabled: replication-manager dynamically adjusts database settings to match best practices.
Warning: Dynamic changes are lost on replication-manager restart unless saved to config.
Recommendation:
DON'T rely on dynamic enforcement - instead:
Permanent settings example (my.cnf):
[mysqld]
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
slave_parallel_mode = optimistic
slave_parallel_threads = 4
rpl_semi_sync_master_enabled = ON
rpl_semi_sync_slave_enabled = ON
Use dynamic enforcement for: Testing and validation, not production operations.
Reference: /pages/07.howto/02.enforce-best-practice/docs.md
Backup types available:
Logical backups:
Physical backups:
Snapshot backups:
Configuration parameters:
# Storage location
backup-logical-type = "mysqldump" # or "mydumper"
backup-physical-type = "mariabackup"
backup-disk-threshold-warn = 85
backup-disk-threshold-crit = 95
# Restic backups with auto-purge
backup-restic = true
backup-restic-purge-oldest-on-disk-space = true
backup-restic-purge-oldest-on-disk-threshold = 90
Recommendation:
Reference: /pages/05.configuration/14.maintenance/02.backups/docs.md