edit this page

Logs and Troubleshooting

6.3.1 Logs and Troubleshooting

replication-manager writes several log files and diagnostic files. Knowing where each one lives and what it contains is the first step when investigating unexpected behaviour.

6.3.2 replication-manager Daemon Log

The main process log captures everything replication-manager does: topology discovery, failover decisions, API calls, job execution, and internal errors.

Setting	Effect
`log-file`	Path to write the daemon log. Defaults to stdout if not set. Example: `/var/log/replication-manager.log`
`log-level`	Verbosity from 1 (errors only) to 7 (full monitoring loop trace). Levels above 3 print every monitoring tick and are intended for debugging only.
`log-syslog`	Mirror all log output to the local UDP syslog port in addition to the log file. Useful for feeding Graylog, Splunk, or ELK without a file agent.
`log-rotate-max-age`	Rotate logs older than this many days. Default: `7`.
`log-rotate-max-backup`	Keep at most this many rotated log files. Default: `7`.
`log-rotate-max-size`	Rotate when the log file exceeds this size in MB. Default: `5`.
`verbose`	Add extra debugging output on top of `log-level`.

All of these are set in the global (non-cluster) section of the configuration file. See Daemon Monitoring for the full parameter reference.

6.3.2.1 Typical log-level guide

Level	When to use
`1`	Production — errors and critical events only
`2`	Default — normal operations including state changes
`3`	Failover debugging — shows decision logic
`5–7`	Full trace — every SQL sent to monitored servers, every monitoring tick

6.3.3 Per-Cluster Diagnostic Files

These files are written inside the replication-manager data directory, one subdirectory per cluster:

{monitoring-datadir}/{cluster-name}/
├── sql_general.log      All SQL sent to backend servers during monitoring
├── capture_*.log        Snapshot dumps triggered by error codes
└── {host}_{port}/
    └── ...              Per-server config and state files

6.3.3.1 sql_general.log

Enabled by log-sql-in-monitoring = true. Records every SQL query that replication-manager sends to monitored servers (SHOW SLAVE STATUS, SHOW GLOBAL STATUS, etc.).

This log is valuable during failover and rejoin analysis: you can replay exactly what replication-manager queried and in what order to reconstruct the decision path.

log-sql-in-monitoring = true

6.3.3.2 Capture files

When monitoring-capture is enabled, replication-manager automatically dumps a diagnostic snapshot whenever a monitored error code fires. Each capture saves:

SHOW SLAVE STATUS
SHOW PROCESSLIST
SHOW GLOBAL STATUS
SHOW ENGINE INNODB STATUS

Capture is triggered by the error codes listed in monitoring-capture-trigger (default: ERR00076,ERR00041 — max connections threshold and slave delay). The dump runs for 5 monitoring ticks and the oldest files are purged once the count exceeds monitoring-capture-file-keep.

Setting	Default	Effect
`monitoring-capture`	`true`	Enable automatic capture on error
`monitoring-capture-trigger`	`"ERR00076,ERR00041"`	Comma-separated error codes that start a capture
`monitoring-capture-file-keep`	`5`	Maximum number of capture files to retain

6.3.4 Memory Profiling

memprofile = "/tmp/repmgr.mprof"

When set, replication-manager writes a Go memory profile to this path. Load it with go tool pprof to investigate memory growth:

go tool pprof /tmp/repmgr.mprof

6.3.5 Common Troubleshooting Steps

6.3.5.1 Replication-manager won't connect to a server

Set log-level = 5 temporarily and restart
Watch the daemon log for the connection attempt and error
Verify db-servers-hosts, db-servers-credential, and network access
Check that log-sql-in-monitoring = true to see the exact queries being sent

6.3.5.2 Failover triggered unexpectedly

Check the daemon log around the time of the event — look for the ERR code that fired
Check the capture files in {monitoring-datadir}/{cluster-name}/ — they contain the exact server state at the moment of the error
Look at monitoring-capture-trigger to see which codes arm a capture

6.3.5.3 Alert firing but no notification received

Verify the error code is in monitoring-alert-trigger (see Alerting Configuration Guide)
Check if the code is in monitoring-ignore-errors (suppression list)
Check if scheduler-alert-disable is active (scheduled blackout)
Set log-level = 3 to see alert dispatch in the daemon log

6.3.5.4 High log volume

Lower log-level back to 1 or 2 for production
Enable log-rotate-max-size to cap file growth
Use log-syslog with a log aggregator that handles volume and retention externally