Architectural deep-dive of GOV.UK
We can cover a lot of GOV.UK architecture by asking ourselves three questions:
- What happens when a user visits a page on GOV.UK?
- What happens when a publisher hits ‘Publish’?
- What happens when a developer deploys a change to an application?
What happens when a user visits a page on GOV.UK?
The browser queries a local DNS server to turn the domain name into an IP address. The local DNS server might be able to answer immediately from its cache. If not, it will query the authoritative name servers for the domain.
gov.uk. is a country-code second-level domain or ccSLD (like
co.uk.) and its
authoritative servers are hosted by Jisc. Unusually for a ccSLD,
also a website, and hosts the redirect from
www.gov.uk. The records
for these two domains are within the
gov.uk second-level zone hosted by Jisc.
GDS hosts several third-level domains under
gov.uk on Amazon Route53, for
service.gov.uk. The configuration for these is in the
govuk-dns-config repo and is deployed using code from the govuk-dns repo.
www.gov.uk is a CNAME record which ultimately points to
fastly.net domain name is hosted by special nameservers at the Fastly content
delivery network, which aim to respond with the IP address of the Fastly cache node
which is “closest” to the user. Read more about Fastly in the next section, or
read more about gov.uk DNS.
CDN and caching
GOV.UK uses the Fastly CDN to handle the majority of requests, which - as well as reducing load on GOV.UK (‘origin’) by around 70% - provides ‘edge nodes’ (servers) that are closer to our end users (particularly those outside the UK).
Fastly uses Varnish for caching, with a default cache time of 1 hour. (Read “Our content delivery network” for more information). If Fastly doesn’t have a page in its cache, it fetches the page from origin.
Caches can be purged using the cache-clearing-service, which tells Fastly to soft-purge (i.e. only remove the cached version once it has received the new version from origin). This cache clearing service is triggered automatically when pages are updated - more on that later.
If a Fastly request to origin returns a 5xx response, Fastly will request content from the mirror, which is static HTML hosted in an S3 bucket on AWS. The contents of the mirror are updated daily via a govuk-crawler-worker, which recursively crawls GOV.UK URLs from a message queue, visiting the pages and saving the output to disk.
Routing on the CDN
As well as for caching, Varnish is used for the redirection from
www.gov.uk, which is configured in Varnish Configuration Language (VCL) and
uploaded directly to Fastly via govuk-cdn-config.
Other redirects that happen at the Fastly level include bouncer: a GOV.UK application responsible for redirecting traffic from old pre-GOV.UK websites. This is configured via transition and transition-config. Read Transition architecture for more detail.
Routing on GOV.UK
Getting to the ‘router’ application
Some requests make it through the CDN and cache layers to ‘origin’. Origin is a stack of computers in the cloud - in this case, AWS - and its entry point is a load balancer.
The load balancer knows based on the hostname which machine ‘class’ to route to. Different classes of machine run different sets of GOV.UK applications. How many machines are allocated to a class - and how big those machines are - is configured using Terraform, in govuk-aws. What runs on each machine class is configured in govuk-puppet, a file and process management system we’ll cover in more detail later.
External requests are routed to a ‘cache’ machine, where the request is received by an Nginx web server running on the machine. Nginx proxies some routes directly to other apps, such as asset URLs being routed to asset-manager - this is configured with govuk-puppet. However, Nginx proxies most requests to Varnish. If Varnish has the route response in its cache, that is returned, otherwise it proxies the request to router, which is a GOV.UK maintained application running on the cache machines.
Routing via ‘router’
‘Router’ is a reverse proxy app written in Go. It is designed to be fast, storing all known routes in memory using a prefix trie, which it loads from a MongoDB database of all known routes.
The MongoDB database is written to via router-api whenever a route is added or changed: the Content Store (which we’ll cover later) talks to Router API directly to do this. Router API then sends POST requests to every instance of Router, whereupon each Router then reloads all routes from MongoDB.
Routes have different handlers. Routes marked as
gone return a 410 Gone
response. Routes marked as
redirect serve a 301 Moved Permanently
response. These handlers are useful for when content is deleted or
superseded. Most publishing apps provide a way of deleting or redirecting
their content, but it’s worth noting the short-url-manager app, whose
sole purpose is to create special redirect routes to allow the creation
of short, memorable URLs that redirect to longer URLs, often as part of a
Routes with a
backend handler are routed to the relevant rendering app,
based on the
backend_id of the route, which is derived from the
rendering_app field in the corresponding content item in the Content
Store - we’ll cover this later. For example, if the route has a
frontend, it will forward the request to the frontend
Once Router has forwarded the request to the right rendering app, the
rendering app itself has to do some routing. Most GOV.UK front-end apps are
built in Rails, which means typically there is a
routes.rb file that
maps the route to a controller. The controller takes the URL path and any
parameters and decides how to render the page.
Many pages require the application to make a request to the Content Store to retrieve the corresponding content. Some pages are associated with collections of content, rather than simply one content item. If it is a static collection, such as a homepage which references several news stories, then this remains just one content item that has been expanded via “link expansion” (which we’ll cover later) to ‘include’ the other content items within it. If it is a dynamic collection, such as a search results page, then content items are retrieved via the search-api.
Static JS/CSS is delivered over https://assets.publishing.service.gov.uk. Custom assets, such as images, are delivered over the same domain and uploaded by content designers via asset-manager. Under the hood, all of these assets live in an AWS S3 bucket.
The request is resolved through DNS, more often than not hitting the CDN/cache layers. Some requests make it through to origin, where they’re routed to the machine running the (usually Rails-based) rendering application that knows how to handle the request.
What happens when someone hits ‘Publish’?
Draft and live stacks
Everything you’ve just read about in the first section exists in two stacks: draft and live. These are very similar to each other: each is a collection of machines in the cloud, running GOV.UK applications. Everything that runs in the live stack also runs in the draft stack, in order to have a way of previewing content in a non-public-facing way. However, the draft stack also has additional machines that run the publishing apps.
Applications shouldn’t know what stack they’re in - they’re simply configured to talk to other applications in their stack.
The live stack entry point is the ‘router’ app. You can swap
www-origin to bypass Fastly and view the live stack at origin. This is
only available to office IPs / VPN, and to Fastly IP addresses (configured in
The draft stack entry point is authenticating-proxy, which sits in front
of ‘router’. You can swap
draft-origin to view the draft stack at
origin. The draft stack is not IP-restricted, as we need to be able to share
links to be reviewed (“2i’d”) or fact-checked by non-Government departments.
It is, however, only visible to users who have been verified through
Authenticating Proxy, by signing into signon or by providing a valid
auth_bypass_id (as a URI
parameter or session cookie). Read more about previews in “How the draft stack works”.
Signon doubles up as an authorisation platform, as it associates users with arbitrary permissions, so a publishing app can query if the current user has the necessary permissions to perform a given action, such as publishing content.
Publishing API vs Content Store
At this point it’s probably worth summarising what “content” is. Almost every piece of content on GOV.UK lives in a database called “content-store”, which stores only the latest “edition” of that content. Internally the content is referred to as a “document”, even if it is not itself a document. Content is retrieved via the “Content API”, which lives in the content-store repo.
Content is published to the Content Store via the Publishing API, which
stores all of the editions of the document, and performs validation checks
whenever it receives a new edition. Every piece of content has a
corresponding to a particular JSON schema defined in govuk-content-schemas.
Most backend apps have their own databases modelling documents in their own
way; at the point of sending the document to Publishing API, they transform the
document to a JSON payload conforming to the appropriate schema.
When a new edition is sent to Publishing API, it is automatically published to the draft Content Store, replacing whatever contents existed for that document beforehand. An edition must be explicitly published for it to go to the live Content Store, where it becomes visible to the outside world.
Link expansion, mentioned in Rendering is the process of joining related content items (such as the title and details of a document’s parent, used for navigational breadcrumbs) into a single JSON payload, so that rendering apps don’t need to handle the complexity of pulling all of that data together manually. Link expansion happens in Publishing API at the point of sending an edition downstream to the Content Store.
Downstream Sidekiq background processing triggered by publishing
The Publishing API could update the Content Store directly, but the scale of GOV.UK means we’re safer offloading that call to a background process to be processed when resources become available. In addition, when we publish a new edition, we often want to trigger some other actions as a result. For example, we want to send an email to anyone subscribed to that content.
We use Sidekiq to manage the background processing. When each Sidekiq process is evaluated, a message is put onto a RabbitMQ queue (which runs on its own machines in Carrenza and AWS). RabbitMQ is a message broker: when a message is broadcast to a RabbitMQ exchange, it forwards the message to its consumers. These consumers retrieve the content item and do something in response, such as:
- Clear the page’s cache, via the cache-clearing-service
- Send emails to users subscribed to that content. (The
exceptions to this are
specialist-publisher, which communicate directly with email-alert-api to ensure emails go out immediately)
Content Store registers the route for the content item via Router API. This happens inside the Sidekiq job directly, rather than in a downstream process.
Sidekiq queues: high and low priority
Updating one content item often requires updating other pieces of content. For example, if a content item’s title has been changed, then content items which refer to that content item will need to be updated to use the new title. Sometimes a single change can trigger changes in thousands of items.
Putting both the directly changed and indirectly changed content items on the same queue would mean it would take a long time to see the changes in a document you’ve edited. Generally, it is less important to see a quick change to the indirectly changed content than it is to see a change in the directly changed content items. Therefore we have a concept of ‘high’ and ‘low’ priority queues.
The main content item is processed in the high priority queue. Exactly the same things happen to the low priority content items as the high priority content items; it just tends to take longer as there are more items to process.
The process for finding the content items affected by a content change is
known as dependency resolution. Content items can be associated with
other content items in a number of ways. For example, you may provide an
array of organisation IDs in your payload
when sending to Publishing API, to indicate that those organisations are
responsible for the content (this is stored on the content item in content
Content can also be tagged to taxonomies, which are used to describe where
in the site hierarchy the content sits. These are stored on the content item
links.taxons. Some apps have their own interface for tagging, or you can
tag content independently using content-tagger.
The publishing app uses the Publishing API to create and synchronise a new edition of a document, which consolidates related content items into it prior to sending to Content Store. All affected content items are added to a publishing queue, which triggers downstream actions such as cache clearing and email alerts.
What happens when a developer deploys a change to an application?
Everything you’ve read about the live and draft stacks, you can now multiply threefold, as they each exist in the following environments:
Data is copied from Production to Staging - and from Staging to Integration - every 24 hours via automated Jenkins jobs. This way our environments are always roughly in sync, although it’s worth noting that email addresses are anonymised and access-limited documents are obfuscated before data is copied. The data copying is mostly configured in govuk-puppet, although apps not hosted on AWS are configured in env-sync-and-backup.
We have detailed docs on how to deploy an application. But what happens under the hood?
When a PR is opened against a GOV.UK repository, the corresponding Jenkins job on CI Jenkins runs the tests (although we’re gradually moving to GitHub Actions). The Jenkins jobs are created in the first place by being added to govuk-puppet, and configured to use the govuk-jenkinslib library to build and run the tests.
The tests report back to the PR as a GitHub check, though other checks may also be required before the PR can be merged (govuk-saas-config defines things such as whether branches must be up to date with ‘master’ before merging).
On merge, the same Jenkins job that ran the tests runs the tests again, then creates and pushes a git tag to GitHub, then sends a message to the deploy Jenkins environment to build the govuk-app-deployment job. This clones the repository, checks out the tag and deploys the code to the corresponding nodes on Integration using Capistrano. Capistrano does deployments only by default, but can also do deployments ‘with migration’ or ‘with hard restart’, etc, depending on the nature of the change.
A developer must manually trigger a deployment to Staging and Production through the release app. This uses the same Jenkins/Capistrano pipeline as for Integration, but on the Staging and Production Jenkins environments respectively.
Some apps require extra care when deploying. Static, for example, requires a 30 minute wait between environment deploys. This is because Static is consumed directly by GOV.UK applications at runtime - not a gem version like a normal dependency - and responses from Static are cached for half an hour, so problems may not be visible until after this period. There are instructions in the release app UI for this.
Puppet on GOV.UK
As discussed in the routing section, we have different ‘classes’ of machines running in the cloud; to recap, the “cache” machines run the Router application. These classes are configured in govuk-puppet, which uses Puppet under the hood: tooling which configures resources such as files and processes.
Puppet runs in a master/agent setup. There is a single “puppetmaster” running on its own class of machine, whereas the agents run on all the other machines (irrespective of class). The Puppet master is in charge of keeping all of the Puppet agents in sync with itself.
Icinga alerts us to problems with our machines and apps. There is a wide variety of different alerts, all of which are configured in Puppet. Each puppet agent is responsible for configuring Icinga alerts using the Icinga Puppet module. Alerts might be triggered by an application’s health check endpoint being unavailable, or by a machine having low available disk space, or a number of other reasons.
For monitoring, ‘filebeat’ is used to send logs to Logit, and ‘statsd’ exports most monitoring metrics, which can be viewed in Grafana. This is configured in govuk_app_config, which is included in most GOV.UK apps. Read more about tooling for monitoring.
If an app release contains a major change such as a renamed environment variable, then it will require an application restart, which would bring the application offline on that machine. The load balancer would begin serving traffic from different nodes of the same class: for this reason it is important that all machines are updated at different times. Therefore, each instance runs puppet every half hour, with the Puppet agents configured to run after the puppetmaster at randomised times.
When deploying Puppet, the latest versions of govuk-puppet and govuk-secrets are copied to the puppetmaster. On each Puppet run, the Puppet agent checks for differences between what is sees and what the puppetmaster says should be there to see if they have diverged (“configuration drift”) and whether they should reset themselves against the master. Only after all the Puppet agents have updated can you be confident that your Puppet change hasn’t broken anything, which is why you must wait 30 minutes between Puppet deploys.
When code gets merged into the
master branch, it is automatically deployed
to Integration and a release tag is created in GitHub. Another Jenkins job uses
Capistrano to deploy the release to the relevant machines in the cloud. The
same process for Staging and Production is manually triggered by a developer.
Puppet is used to keep each node’s environment consistent, and to monitor the
health of each application.