Table of contents
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert

Icinga alerts

Varnish port not responding

Under high load, it is possible that the Varnish child process which handles connections will timeout on the healthcheck from the parent. If that happens and the replacement child process also fails to start, Varnish can get in a state where it is not responsive.

The ‘varnish port not responding’ check attempts to contact http://localhost:7999/ and get a response. If it doesn’t, then it will raise an urgent alert as this might lead to 1/3 of user requests failing.

To diagnose, check for messages like this:

$ fab $environment -H cache-3.router sdo:'grep varnishd /var/log/messages'
[cache-3.router] out: Jul  7 00:17:02 cache-3 varnishd[1620]: Child (1630) died signal=3
[cache-3.router] out: Jul  7 00:17:03 cache-3 varnishd[1620]: child (27973) Started
[cache-3.router] out: Jul  7 00:17:25 cache-3 varnishd[1620]: Child (27973) said Child starts
[cache-3.router] out: Jul  7 00:17:25 cache-3 varnishd[1620]: Child (27973) said Child dies
[cache-3.router] out: Jul  7 00:17:25 cache-3 varnishd[1620]: Child (27973) died status=1

You may see only one process called /usr/sbin/varnishd in the process table, with owner root. The child process if it exists will be owned by nobody:

$ fab $environment -H cache-3.router do:'ps -ef | grep [/]usr/sbin/varnishd'
[cache-3.router] out: root      8273     1  0 06:08 ?        00:00:00 /usr/sbin/varnishd -P /var/run/ -a :7999 -f /etc/varnish/default.vcl -T -t 900 -w 1,1000,120 -S /etc/varnish/secret -s malloc,5985M
[cache-3.router] out: nobody    8277  8273  1 06:08 ?        00:03:52 /usr/sbin/varnishd -P /var/run/ -a :7999 -f /etc/varnish/default.vcl -T -t 900 -w 1,1000,120 -S /etc/varnish/secret -s malloc,5985M

Check whether there are any children of the current parent process (this check will fail where it succeeds below):

$ fab $environment -H cache-3.router do:'/usr/lib/nagios/plugins/check_procs -c 1:1 -C 'varnishd' -p `< /var/run/`''
[cache-3.router] out: PROCS OK: 1 process with command name 'varnishd', PPID = 8273

You can view the number of 5xx errors by logging into Logit, and using this query:

host:cache* AND @fields.status:[500 TO 599]

To resolve Varnish port not responding

Restart Varnish on this machine:

$ fab $environment -H cache-3.router cache.restart
This page was last reviewed on 25 March 2019. It needs to be reviewed again on 25 September 2019 by the page owner #govuk-2ndline .
This page was set to be reviewed before 25 September 2019 by the page owner #govuk-2ndline. This might mean the content is out of date.