Tools: Icinga, Grafana and Graphite, Kibana and Fabric
Icinga is used to monitor alerts that we have set up. It can be a bit hard to navigate but there are only a few views you need to know about (listed in the left-hand navigation):
- Unhandled Services
- Alert History
- If an alert is red or has a icon it is critical
- If an alert is yellow or has a icon it is a warning
- If an alert is green or has a icon it has recently recovered
- If an alert is purple or has a icon Icinga cannot retrieve data for it
- If an alert has a icon the alert is coming on and off or ‘flapping’
A service may have two additional URLs associated with it which will assist in investigating alerts. These are included in the Icinga interface with these icons:
- Action URL () typically links to a graph. If the check uses Graphite for its source of data then the graph will also include the warning and critical threshold bands.
- Notes URL () links to a page in this manual describing why a given check exists and/or how to go about resolving it.
They will appear next to the service name in the service overview page or on the top-right of the page when viewing a specific service.
If you want to dig a little deeper into the history of a specific alert click on it in the “Unhandled Services” view. In the top left of the main window there a few links. “View Alert Histogram For This Service” and “View Trends For This Service” are particularly useful.
Grafana lets us create nice dashboards using Graphite data.
Useful Grafana dashboards:
The full list of Grafana dashboards is stored in the Puppet repo
Graphite is a graphing tool that allows us to draw graphs of various metrics that we put into it. Graphite has two main views: a composer to build individual graphs and a dashboard to put multiple graphs together.
We are currently locked at version 0.9.13.
To build a graph, you can add one or more graph targets in the composer by either clicking on them in the left frame. Some useful targets are:
stats.cache-?_router.nginx_logs.www-origin.http_5xxto graph the rate of HTTP errors from all cache machines (note the question mark to pattern-match multiple data series: * also works).
stats.backend-?_backend.nginx_logs.content-store_publishing_service_gov_uk.http_5xxto show HTTP errors for a specific app on all backend machines.
The composer offers tab completion, although it doesn’t handle patterns very well.
To add one of these graphs to a dashboard, you can copy the graph image URL and select Graphs → New Graph → From URL from the dashboard menu.
Both Graphite views let you adjust the time range of graphs, although they both do it in different ways. The composer view offers two buttons to select absolute and relative time ranges (), and the dashboard view has ones with labels ().
Apply Graphite functions to your data to make it more useful.
One particularly useful Graphite function is
keepLastValue. If your graphs come out nearly black with a few spots of colour in them, you probably want this one. Both views have an “Apply Function” button.
Kibana is a log viewer and search engine. Access GOV.UK Kibana through Logit.
In Kibana, you can filter down log messages to show you just the ones you want. Say you’ve spotted a large number of errors coming from the content store related to MongoDB connections, and you want to find out whether the MongoDB logs show anything strange.
You can narrow down which log messages you want using the column browser on the left:
application are some particularly useful ones. The magnifying glass symbol next to each value lets you build up a query string and tinker with it.
You can tweak the time range manually with the drop down at the top or by dragging on the timeline.
Check out some of the useful Kibana queries to get an idea of what’s possible.
Logs are sent to Kibana using Filebeat.
The Fabric scripts are useful for running something on a set of machines. For instance, to restart all instances of the content store on backend boxes:
fab $environment class:backend app.reload:content-store
app.py class for different methods you can use. To run more specific commands you can run the following (
sdo for sudo):
fab $environment class:backend sdo:"service content-store reload"
For more information, check out the Fabric scripts README.
On the blog
More about Monitoring
- Add a deployment dashboard for an application
- Add an Icinga passive check to a Jenkins job
- Add sidekiq-monitoring to your application
- Error reporting with Sentry
- GOV.UK and Virtual Private Networks (VPNs)
- Graphite and deployment dashboards
- How to deal with errors
- Monitor Sidekiq queues for your application
- Monitoring screens
- Nagios NRPE connection failures
- Pingdom Bouncer canary check
- Uptime Metrics
- Use AWS X-Ray to trace app requests
- Use Terraboard to monitor Terraform state