Metrics are measurements of something. GOV.UK use metrics to monitor the service in realtime, and store these metrics over time, which can help to understand changes that are occurring.
Graphite is the service used on GOV.UK to store metrics. Normally, metrics are sent by applications to another service called statsd, which will run on the local machine, and then the statsd service will forward the metrics to Graphite.
Once metrics are stored, the Graphite web interface can then be used to query the metrics. There is another service, Grafana, which doesn’t store any data, but can access Graphite, and is another way in which you can view metrics about GOV.UK.
One important difference when comparing Graphite and Grafana for visualising metrics is that Grafana can present data Elasticsearch and Graphite, even on the same dashboard, whereas Graphite can only present data from Graphite.
Graphite stores metrics in a hierarchy, and the time intervals, retention periods and aggregation methods it uses are configurable.
StatsD Gauges in Graphite
A gauge is one of the types of metrics that statsd can handle. In general, a gauge normally refers to a number that can go up or down.
On GOV.UK, StatsD is configured to delete gauges. By default, when a metric is sent to StatsD, statsd will continually send it to Graphite. However, for gauges, it will only send it one time.
Grafana on GOV.UK is used for dashboards. Dashboards can be created directly in the web interface (first you must login using the credentials username: admin password: admin), or added through govuk-puppet.
Usually you would want to develop dashboards in Grafana by editing them through the web interface, and then export them and add them to govuk-puppet once you are happy with it. Adding the dashboard to govuk-puppet means that it can be easily kept in sync between environments.
More about Monitoring
- Add an Icinga passive check to a Jenkins job
- Add sidekiq-monitoring to your application
- Error reporting with Sentry
- GOV.UK and Virtual Private Networks (VPNs)
- How to deal with errors
- Monitor Sidekiq queues for your application
- Nagios NRPE connection failures
- Pingdom Bouncer canary check
- Uptime Metrics