Skip to main content
Warning This document has not been updated for a while now. It may be out of date.
Last updated: 7 Nov 2023

Content Data architecture

Content Data is a service that allows organisations to view performance and content metrics about GOV.UK pages they maintain, so that they can manage their content effectively over time. The data helps them decide what content to improve and prioritise.

Behind the scenes it is made up of:

Overview of the elements involved in Content Data

Content Data Admin

Screenshot of the Content Data app

This is a simple Rails app that provides the frontend service that users interact with. It has a database used solely for storing Signon users (required by the gds-sso gem), but mainly relies on metrics stored in Content Data API’s data warehouse.

Users of the app can:

  • Search for content (filter by time period, title, URL, document type, organisation)
  • View performance and content metrics for pages
  • Export data to CSV

This can be accessed at https://content-data.publishing.service.gov.uk/content/.

CSV Exporter

This is a feature that allows users to export search results and associated metrics to CSV. It uses a Sidekiq worker to generate the CSV in the background and stores it in an AWS S3 bucket. An email is then sent to the user which contains the downloadable link that is valid for 7 days. The expiry time is configured in the S3 bucket settings under Lifecycle configuration.

Overview of the CSV Exporter process

Content Data API

This is the data warehouse used by Content Data Admin where all the metrics are stored. The data is collected from various sources:

It can be directly accessed at, e.g, https://content-data-api.publishing.service.gov.uk/api/v1/metrics.

Data sources

Google Analytics (GA)

Metrics collected from GA for individual GOV.UK pages include:

  • Views and navigation
  • User feedback
  • Internal searches

This data is fed into content-data-api using the ETL process.

Support API (Feedback Explorer)

Feedback Explorer is a feature provided by the Support app, which collects information via forms on GOV.UK, for example:

  • The GOV.UK contact form
  • “Was this page useful?” form (at the bottom of every GOV.UK page)

These forms are managed and rendered by the Feedback app. When a user fills in the form on GOV.UK and submits the information, the anonymous data is then sent to Support API to be stored and processed further. Users of the Support app can then query the information in the Feedback Explorer.

Overview of Feedback Explorer

Content Data API only collects metrics on the number of feedback comments for pages on GOV.UK from Support API. This data is fed into its data warehouse using the Extract, Transform, Load (ETL) process.

Publishing API

Content Data API processes messages from the Publishing API in order to detect and track changes to content.

It stores information such as:

  • Word count
  • PDF count
  • Reading time
  • Information about editions and relationships with other documents

This data is fed into Content Data API via AmazonMQ and the Streams Processor.

How data is fed into the Content Data API warehouse

Data is fed directly into the data warehouse through two methods:

  • The Extract, Transform, Load (ETL) Processor
  • The Streams Processor

ETL Processor

Content Data API runs an ETL (Extract, Transform, Load) process daily as a rake task via a Kubernetes CronJob, which copies the data from GA and Support API into its data warehouse.

Overview of the ETL Processor

For more information see the What is the ETL process developer doc.

Streams Processor

This process is responsible for updating information about GOV.UK content from the Publishing API. Note that this data is different from the data collected from the nightly ETL process.

An overview of the process is as follows:

Overview of the Streams Processor