edit this page

Troubleshooting

4.6.1 Troubleshooting High Availability

This page lists every file and API endpoint replication-manager writes or exposes during normal operation, failovers, and crashes — and explains how to read them when something goes wrong.

4.6.2 Data Directory Layout

All runtime state is stored under monitoring-datadir (default /var/lib/replication-manager). Each cluster gets its own sub-directory named after the cluster section in the config file.

/var/lib/replication-manager/
├── panic.log                          # stacktrace on daemon panic (JSON)
├── carbonapi.log                      # embedded Graphite backend log
└── <cluster-name>/
    ├── clusterstate.json              # full cluster state snapshot (persisted every tick)
    ├── sla.json                       # SLA counters (uptime, failover counts)
    ├── queryrules.json                # active ProxySQL query rules
    ├── agents.json                    # OpenSVC agent list
    ├── <cluster-name>.toml            # active runtime config (written by API/GUI changes)
    ├── cache.toml                     # cached config values
    ├── immutable.toml                 # config values locked from API changes
    ├── overwrite.toml                 # config values that override the main config
    ├── failover.YYYYMMDDHHMMSS.json   # one file per failover/switchover event
    ├── sql_error.log                  # SQL errors from all DB connections
    ├── sql_general.log                # all SQL statements executed by replication-manager
    ├── ca-cert.pem                    # generated CA certificate
    ├── ca-key.pem                     # generated CA private key
    ├── server-cert.pem                # server TLS certificate
    └── server-key.pem                 # server TLS private key

4.6.3 Key Files Explained

4.6.3.1 `clusterstate.json`

Written on every monitoring tick. Contains the complete cluster snapshot including:

Current primary and all replicas with their status, GTID positions, and lag
Active error and warning state codes (e.g. ERR00076, WARN0108)
Crash history (dbServersCrashes) — cleared when all servers are back up
Failover history (failoverHistory) — kept across restarts for PITR reference

cat /var/lib/replication-manager/mycluster/clusterstate.json | jq .
# Show current state codes
cat /var/lib/replication-manager/mycluster/clusterstate.json | jq '.crashes'
# Show SLA
cat /var/lib/replication-manager/mycluster/sla.json | jq .

4.6.3.2 `failover.YYYYMMDDHHMMSS.json`

One file is written per failover or switchover event, named with a timestamp. Contains:

Old primary URL and elected new primary URL
Binlog file and position at the moment of the event
GTID positions (IO GTID, executed GTID set)
Semi-sync slave status at failover time
Whether it was a switchover ("switchover": true) or failover

# List all failover events, most recent first
ls -lt /var/lib/replication-manager/mycluster/failover.*.json

# Inspect the most recent event
ls -t /var/lib/replication-manager/mycluster/failover.*.json | head -1 | xargs cat | jq .

The number of retained failover files is controlled by failover-log-file-keep (default: 5).

4.6.3.3 `sql_error.log` and `sql_general.log`

sql_error.log — every SQL error returned when replication-manager queries the database (connection failures, permission errors, invalid queries)
sql_general.log — every SQL statement replication-manager executes against the cluster (useful to audit what the daemon does at each step of a failover or rejoin)

# Watch SQL errors in real time
tail -f /var/lib/replication-manager/mycluster/sql_error.log

# Find statements executed during a failover window
grep "2024-06-01 03:" /var/lib/replication-manager/mycluster/sql_general.log

4.6.3.4 `panic.log`

If replication-manager itself crashes, a JSON-formatted stacktrace is written here. Send this file to Signal18 support when reporting a daemon crash.

cat /var/lib/replication-manager/panic.log | jq .

4.6.4 Log Files

See the Logs reference for the full list of log files and journalctl commands. For HA troubleshooting the most relevant are:

File	What to look for
Main log (`log-file`)	`STATE`, `ERR0`, `WARN0` lines around the incident time
Maintenance log (`*-maintenance.log`)	Rejoin and reseed progress
`sql_error.log`	Permission or connectivity errors that blocked a failover step
`sql_general.log`	Exact SQL sequence replication-manager ran during failover/rejoin

# Find all state changes in the main log
grep 'STATE' /var/log/replication-manager.log

# Find failover-related lines around a specific time
grep '2024-06-01 03:' /var/log/replication-manager.log | grep -E 'failover|Failover|ERR0|STATE'

# Follow the maintenance log during a rejoin
tail -f /var/log/replication-manager-maintenance.log

4.6.5 REST API Endpoints

All in-memory state is also accessible over the API without reading files:

Endpoint	Content
`GET /api/clusters/{name}/topology/servers`	All servers with status, GTID, lag, replication health
`GET /api/clusters/{name}/topology/master`	Current primary details
`GET /api/clusters/{name}/topology/slaves`	All replicas with replication thread status
`GET /api/clusters/{name}/topology/alerts`	Active alert and warning state codes
`GET /api/clusters/{name}/topology/crashes`	Crash and failover history (in-memory)
`GET /api/clusters/{name}/topology/logs`	General cluster log buffer
`GET /api/clusters/{name}/topology/http-logs`	HTTP/API log buffer

# Quick cluster health check
curl -sk -u admin:repman https://localhost:10005/api/clusters/mycluster/topology/servers | jq '.[] | {host: .host, port: .port, state: .state, gtid: .currentGtid, lag: .delay}'

# Active alerts
curl -sk -u admin:repman https://localhost:10005/api/clusters/mycluster/topology/alerts | jq .

# Crash history
curl -sk -u admin:repman https://localhost:10005/api/clusters/mycluster/topology/crashes | jq .

4.6.6 Increasing Verbosity for an Investigation

To get more detail without restarting the daemon, bump the relevant module log levels via config reload or API:

# In cluster config — reload with SIGHUP or via API
log-level           = 4   # full DEBUG for all modules
log-level-topology  = 4   # topology discovery detail
log-level-sql       = 4   # every SQL statement with timing
log-level-sst       = 4   # state transfer / rejoin steps

# Follow debug output in real time
journalctl -u replication-manager -f -p debug

# Or if log-file is configured
tail -f /var/log/replication-manager.log | grep -E 'DEBUG|topology|rejoin|failover'

4.6.7 Common First Steps

Symptom	Where to look
Unexpected failover	`failover.*.json` + main log `STATE` lines around the event time
Failover did not trigger	Main log — check false-positive check results (`ERR00023`, grace period messages)
Replica not rejoining	Maintenance log + `sql_error.log` for permission or GTID errors
Wrong server promoted	`clusterstate.json` — check replication health scores at the time
Daemon crashed	`panic.log` — JSON stacktrace
Proxy not updated after failover	Main log — search `[proxy]`, `[haproxy]`, `[proxysql]` module tags
Config not taking effect	`<cluster>.toml`, `cache.toml`, `immutable.toml` in the datadir — check which layer wins