This page lists every file and API endpoint replication-manager writes or exposes during normal operation, failovers, and crashes — and explains how to read them when something goes wrong.
All runtime state is stored under monitoring-datadir (default /var/lib/replication-manager). Each cluster gets its own sub-directory named after the cluster section in the config file.
/var/lib/replication-manager/
├── panic.log # stacktrace on daemon panic (JSON)
├── carbonapi.log # embedded Graphite backend log
└── <cluster-name>/
├── clusterstate.json # full cluster state snapshot (persisted every tick)
├── sla.json # SLA counters (uptime, failover counts)
├── queryrules.json # active ProxySQL query rules
├── agents.json # OpenSVC agent list
├── <cluster-name>.toml # active runtime config (written by API/GUI changes)
├── cache.toml # cached config values
├── immutable.toml # config values locked from API changes
├── overwrite.toml # config values that override the main config
├── failover.YYYYMMDDHHMMSS.json # one file per failover/switchover event
├── sql_error.log # SQL errors from all DB connections
├── sql_general.log # all SQL statements executed by replication-manager
├── ca-cert.pem # generated CA certificate
├── ca-key.pem # generated CA private key
├── server-cert.pem # server TLS certificate
└── server-key.pem # server TLS private key
clusterstate.jsonWritten on every monitoring tick. Contains the complete cluster snapshot including:
ERR00076, WARN0108)dbServersCrashes) — cleared when all servers are back upfailoverHistory) — kept across restarts for PITR referencecat /var/lib/replication-manager/mycluster/clusterstate.json | jq .
# Show current state codes
cat /var/lib/replication-manager/mycluster/clusterstate.json | jq '.crashes'
# Show SLA
cat /var/lib/replication-manager/mycluster/sla.json | jq .
failover.YYYYMMDDHHMMSS.jsonOne file is written per failover or switchover event, named with a timestamp. Contains:
"switchover": true) or failover# List all failover events, most recent first
ls -lt /var/lib/replication-manager/mycluster/failover.*.json
# Inspect the most recent event
ls -t /var/lib/replication-manager/mycluster/failover.*.json | head -1 | xargs cat | jq .
The number of retained failover files is controlled by failover-log-file-keep (default: 5).
sql_error.log and sql_general.logsql_error.log — every SQL error returned when replication-manager queries the database (connection failures, permission errors, invalid queries)sql_general.log — every SQL statement replication-manager executes against the cluster (useful to audit what the daemon does at each step of a failover or rejoin)# Watch SQL errors in real time
tail -f /var/lib/replication-manager/mycluster/sql_error.log
# Find statements executed during a failover window
grep "2024-06-01 03:" /var/lib/replication-manager/mycluster/sql_general.log
panic.logIf replication-manager itself crashes, a JSON-formatted stacktrace is written here. Send this file to Signal18 support when reporting a daemon crash.
cat /var/lib/replication-manager/panic.log | jq .
See the Logs reference for the full list of log files and journalctl commands. For HA troubleshooting the most relevant are:
| File | What to look for |
|---|---|
Main log (log-file) |
STATE, ERR0, WARN0 lines around the incident time |
Maintenance log (*-maintenance.log) |
Rejoin and reseed progress |
sql_error.log |
Permission or connectivity errors that blocked a failover step |
sql_general.log |
Exact SQL sequence replication-manager ran during failover/rejoin |
# Find all state changes in the main log
grep 'STATE' /var/log/replication-manager.log
# Find failover-related lines around a specific time
grep '2024-06-01 03:' /var/log/replication-manager.log | grep -E 'failover|Failover|ERR0|STATE'
# Follow the maintenance log during a rejoin
tail -f /var/log/replication-manager-maintenance.log
All in-memory state is also accessible over the API without reading files:
| Endpoint | Content |
|---|---|
GET /api/clusters/{name}/topology/servers |
All servers with status, GTID, lag, replication health |
GET /api/clusters/{name}/topology/master |
Current primary details |
GET /api/clusters/{name}/topology/slaves |
All replicas with replication thread status |
GET /api/clusters/{name}/topology/alerts |
Active alert and warning state codes |
GET /api/clusters/{name}/topology/crashes |
Crash and failover history (in-memory) |
GET /api/clusters/{name}/topology/logs |
General cluster log buffer |
GET /api/clusters/{name}/topology/http-logs |
HTTP/API log buffer |
# Quick cluster health check
curl -sk -u admin:repman https://localhost:10005/api/clusters/mycluster/topology/servers | jq '.[] | {host: .host, port: .port, state: .state, gtid: .currentGtid, lag: .delay}'
# Active alerts
curl -sk -u admin:repman https://localhost:10005/api/clusters/mycluster/topology/alerts | jq .
# Crash history
curl -sk -u admin:repman https://localhost:10005/api/clusters/mycluster/topology/crashes | jq .
To get more detail without restarting the daemon, bump the relevant module log levels via config reload or API:
# In cluster config — reload with SIGHUP or via API
log-level = 4 # full DEBUG for all modules
log-level-topology = 4 # topology discovery detail
log-level-sql = 4 # every SQL statement with timing
log-level-sst = 4 # state transfer / rejoin steps
# Follow debug output in real time
journalctl -u replication-manager -f -p debug
# Or if log-file is configured
tail -f /var/log/replication-manager.log | grep -E 'DEBUG|topology|rejoin|failover'
| Symptom | Where to look |
|---|---|
| Unexpected failover | failover.*.json + main log STATE lines around the event time |
| Failover did not trigger | Main log — check false-positive check results (ERR00023, grace period messages) |
| Replica not rejoining | Maintenance log + sql_error.log for permission or GTID errors |
| Wrong server promoted | clusterstate.json — check replication health scores at the time |
| Daemon crashed | panic.log — JSON stacktrace |
| Proxy not updated after failover | Main log — search [proxy], [haproxy], [proxysql] module tags |
| Config not taking effect | <cluster>.toml, cache.toml, immutable.toml in the datadir — check which layer wins |