Skip to main content
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert
Warning This document has not been updated for a while now. It may be out of date.
Last updated: 5 Oct 2022

Jenkins agent not connected to master

If the Jenkins agent is not connect to the master you can have a look at the Jenkins Nodes UI and see if it’s possible to diagnose and solve the problem from there.

View the log for the relevant agent, for example by clicking on the hostname of the agent and then choosing Log from the menu on the left-hand side.

Logs can also be found on the agent machine under /var/lib/jenkins/remoting/logs.

Agent process failing to start

If the agent process is repeatedly starting up and failing with messages like:

deleting obsolete workspace /var/lib/jenkins/workspace/ishing-e2e-tests_govuk-test-TBY2AGKJXBE42MNITWCQM63I6R6XJSSGHBHBPH6P4IOBGXQK5OMA
Unexpected termination of the channel

then try clearing out the workspace directory on the agent:

$ gds govuk connect ssh -e ci ci_agent:0
$ sudo rm -r /var/lib/jenkins/workspace/*

Agent SSH host key changed

If the master is failing to SSH into the agent with a message like “Host key verification failed”, then delete the relevant entry from the known hosts file on the master:

$ gds govuk connect ssh -e integration ci_master
$ sudo -u jenkins ssh ci-agent-<n>

where <n> is the number assigned to the node. You should receive a message from ssh telling you what line to get rid of in known_hosts. Then you can edit the known_hosts acccordingly.

$ sudo nano /var/lib/jenkins/.ssh/known_hosts

Then restart the master process.

This problem should only occur if the agent machine gets recreated, which will change its SSH host key.

Restarting the master process

If there doesn’t seem to be way to solve the problem it may be necessary to restart the master process:

$ gds govuk connect ssh -e integration ci_master
$ sudo service jenkins restart

Restarting the master process will temporarily suspend all CI jobs while the master reconnects to the agents, but it should not cause any jobs to fail.