Skip to main content
Warning This document has not been updated for a while now. It may be out of date.
Last updated: 29 Jul 2024

GOV.UK's sitemap

GOV.UK’s sitemap is available at https://www.gov.uk/sitemap.xml. GOV.UK is far too big to fit into one sitemap, so this file is more of a ‘sitemap index’, which references around 30 other XML files, such as https://www.gov.uk/sitemaps/sitemap_1.xml.

How the sitemap is generated

Every morning, a search-api-generate-sitemap cronjob runs to generate a fresh sitemap.

The cronjob runs the sitemap:generate_and_upload rake task in search-api. This enumerates over all documents in Search API and generates a sitemap matching the format specified in https://www.sitemaps.org/protocol.html. This job also creates the sitemap index.

The sitemap generator is configured to search for documents across all of Search API’s indexes.

How content gets into Search API

The preferred pattern is for content to be published via Publishing API. After an edition is changed, Publishing API publishes a message to the published_documents topic exchange it configured on startup. Interested parties, such as Search API, can subscribe to this exchange to perform post-publishing actions.

Search API listens to the publishing queue using the govuk_message_queue_consumer gem. Its MessageProcessor processes the indexing of the content.

However, message queues aren’t the only way to get content into Search API. Whitehall calls Search API directly, via Whitehall::SearchIndex, which is called by any model that includes the Searchable module. This legacy behaviour is recognised tech debt and should be removed.

Note that there shouldn’t be a situation where Whitehall submits content to Search API both directly and via Publishing API. The Search API’s ‘migrated formats’ file controls which document types Search API expects from each source. There’s a non_indexable section at the bottom that includes all of the Whitehall document types. Search API checks when processing messages from Publishing API whether or not the document type is indexable, and ignores them if it’s not.

Indexes

Documents are spread across three ‘indexes’ in Search API:

  • govuk: the index populated by Publishing API, intended to encapsulate all GOV.UK content
  • government and detailed - the remaining legacy ‘content indexes’, encapsulating some Whitehall content and Detailed Guides respectively.

There are two Search API ADRs documenting the decision to move to one govuk index: ADR-04 and ADR-06. Some legacy indexes (e.g. mainstream) have been fully migrated into it, but the two legacy indexes listed above remain.

One can find out which index a piece of content is saved under, using Search API’s API: see "index": "government" on this example.