Skip to main content
Table of contents
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert
Last updated: 17 Sep 2020

PostgreSQL: replication too far behind

"replication on the postgres standby is too far behind primary [value in bytes]"

Replication from primary to secondary PostgreSQL machines is done using Write Ahead Log (WAL) files, which represent a log of changes. The files are streamed to standby machines, which replay the logs to keep in-sync.

When this alert fires, the Postgres standby replication process may be struggling to keep up (due to load) or may have stopped altogether.

  • If one node is down (resulting in Graphite having NULL values for one metric), you should see a CRITICAL alert.

  • If both nodes are down then Graphite will return no data and an UNKNOWN alert will be raised.

The replication lag is measured by examining the difference in the XLOG location in bytes. You can get a quick view of replication by doing the following:

  • on a postgresql-primary: ps -ef | grep sender

  • on a postgresql-standby: ps -ef | grep receiver

Fix after an overload

If the slave has fallen too far behind or is in an otherwise unrecoverable state then you may need to resync it.

Fix after a restore

If a backup has recently been restored to the primary, it may be that the primary did not keep enough WAL segments around to bring the standby up to date. This can be fixed by restoring the backup to the standby as well, after which normal replication will take over.

Fix for a new machine

If this is a new machine, or if you have recently resynced it, the issue may be that PostgreSQL replication is set up and running correctly, but collectd needs to be restarted on either the primary or standby to ensure that Graphite is receiving the metrics it needs.

$ sudo service collectd restart