Skip to main content
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert
Warning This document has not been updated for a while now. It may be out of date.
Last updated: 5 Apr 2022

Varnish port not responding

Under high load, it is possible that the Varnish child process which handles connections will timeout on the healthcheck from the parent. If that happens and the replacement child process also fails to start, Varnish can get in a state where it is not responsive.

The ‘varnish port not responding’ check attempts to contact http://localhost:7999/ and get a response. If it doesn’t, then it will raise an urgent alert as this might lead to 1/3 of user requests failing.

To diagnose, check for messages like this:

$ grep varnishd /var/log/messages
[cache-3.router] out: Jul  7 00:17:02 cache-3 varnishd[1620]: Child (1630) died signal=3
[cache-3.router] out: Jul  7 00:17:03 cache-3 varnishd[1620]: child (27973) Started
[cache-3.router] out: Jul  7 00:17:25 cache-3 varnishd[1620]: Child (27973) said Child starts
[cache-3.router] out: Jul  7 00:17:25 cache-3 varnishd[1620]: Child (27973) said Child dies
[cache-3.router] out: Jul  7 00:17:25 cache-3 varnishd[1620]: Child (27973) died status=1

You may see only one process called /usr/sbin/varnishd in the process table, with owner root. The child process if it exists will be owned by nobody:

$ ps -ef | grep [/]usr/sbin/varnishd
[cache-3.router] out: root      8273     1  0 06:08 ?        00:00:00 /usr/sbin/varnishd -P /var/run/varnishd.pid -a :7999 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 900 -w 1,1000,120 -S /etc/varnish/secret -s malloc,5985M
[cache-3.router] out: nobody    8277  8273  1 06:08 ?        00:03:52 /usr/sbin/varnishd -P /var/run/varnishd.pid -a :7999 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 900 -w 1,1000,120 -S /etc/varnish/secret -s malloc,5985M

Check whether there are any children of the current parent process (this check will fail where it succeeds below):

$ /usr/lib/nagios/plugins/check_procs -c 1:1 -C 'varnishd' -p `< /var/run/varnishd.pid`
[cache-3.router] out: PROCS OK: 1 process with command name 'varnishd', PPID = 8273

You can view the number of 5xx errors by logging into Logit, and using this query:

host:cache* AND @fields.status:[500 TO 599]

To resolve Varnish port not responding

Restart Varnish on this machine:

$ sudo /etc/init.d/varnish restart