GOV.UK's sitemap
GOV.UK’s sitemap is available at https://www.gov.uk/sitemap.xml.
GOV.UK is far too big to fit into one sitemap, so this file is more of a ‘sitemap index’, which references around 30 other XML files, such as https://www.gov.uk/sitemaps/sitemap_1.xml
.
How the sitemap is generated
Every morning, a search-api-generate-sitemap cronjob runs to generate a fresh sitemap.
The cronjob runs the sitemap:generate_and_upload rake task in search-api. This enumerates over all documents in Search API and generates a sitemap matching the format specified in https://www.sitemaps.org/protocol.html. This job also creates the sitemap index.
The sitemap generator is configured to search for documents across all of Search API’s indexes.
Indexes
search-api-v2 has no concept of an ‘index’. search-api, on the other hand…
Documents are spread across three ‘indexes’ in Search API:
govuk
: the index populated by Publishing API, intended to encapsulate all GOV.UK contentgovernment
anddetailed
- the remaining legacy ‘content indexes’, encapsulating some Whitehall content and Detailed Guides respectively.
There are two Search API ADRs documenting the decision to move to one govuk
index: ADR-04 and ADR-06. Some legacy indexes (e.g. mainstream
) have been fully migrated into it, but the two legacy indexes listed above remain.
One can find out which index a piece of content is saved under, using Search API’s API: see "index": "government"
on this example.