Skip to main content

Application: govuk-knowledge-graph-gcp

GOV.UK content data and cloud infrastructure for the GovSearch app

Ownership
#data-engineering
Category
Data science

README

  • GOV.UK content data in BigQuery, for analytical workloads.
  • Cloud infrastructure for the GovSearch app.

Documentation

Most documentation is in README.md files and docs directory in this repository. There is also GOV.UK Data Community Technical Documentation.

Data pipeline overview

  1. A workflow subscribes to notifications from the GOV.UK S3 Mirror that a new database backup of the Publishing API is available. The workflow creates an instance of a virtual machine.
  2. The virtual machine fetches the database backup file, extracts its data, and uploads that into BigQuery.
  3. Some SQL queries are scheduled to run daily, which call other SQL routines to refresh various tables from the newly uploaded data.

Access and permissions

People are granted access by membership of Google Groups. Other Google Cloud Platform projects are granted access via service accounts. Access is granted by editing each environment’s tfvars file, such as terraform-dev/environment.auto.tfvars.

Google Groups

  • govgraph-private-data-viewers has roles/bigquery.dataViewer in relation to each BigQuery dataset except ‘test’, and roles/bigquery.jobUser to be able to run queries that are billed to the billing account of the govuk-knowledge-graph* projects.
  • govgraph-developers has the roles/owner role in relation to each govuk-knowledge-graph* project.

Tests

There are hardly any tests.

SQL

The most likely cause of an error in GovSearch queries is a change to the data and document schemas in the Publishing API.

It is difficult, in general, to test chains of SQL statements. DBT is popular for doing so, but adds a considerable abstraction, as well as requiring Python, which is discouraged in GOV.UK.

A scheduled query runs every hour, and raises an error if any tables have zero rows or have not been updated in the past 25 hours. The error is automatically detected in the logs, and an alert is raised, which sends an email to the govsearch-developers Google Group. Once the problem has been addressed, close the issue.

Ruby

Two of the BigQuery Remote Functions are implemented in Ruby and have unit tests. They are parse-html and html-to-text. Other BigQuery Remote Functions are somewhat trivial.

Maintainers

This project is maintained by the GOV.UK team, which is part of the Government Digital Service.

Common tasks

Import data from somewhere new

Look at https://github.com/alphagov/govuk-knowledge-graph-gcp/pull/594, which derives data from the Publisher app database and puts it into BigQuery.

Troubleshooting

Outdated or empty BigQuery tables

If GovSearch gives unexpected results, then the tables in BigQuery might not have been updated correctly. Usually that means a table either hasn’t been updated at all within the last 24 hours, or it has been updated and is now empty. You can quickly check every table by querying a view called test.tables-metadata by writing a query like SELECT * FROM test.tables-metadata;. The table is checked automatically every hour, and if it finds old or empty tables then an ‘incident’ is created, and an email is sent to govgraph-developers@digital.cabinet-office.gov.uk.

Source data glitch

Check that the database backup files in the govuk-s3-mirror are the expected size (many gigabytes) by looking in the bucket.

Check that the Publishing API hasn’t changed its schemas.

Other representations of GOV.UK content

There are several different representations of GOV.UK content, including:

  • Publishing API
  • Content Store
  • Search API
  • CDN cache (content delivery network)
  • Mirror (HTML pages crawled nightly)
  • National Archives (snapshots of content over time)

None of these representations met a need for advanced searching and filtering for content designers, or a need for low-level structured data for developing data science applications. Hence the Knowledge Graph was developed.

Technical debt

See Technical debt.

Contributing

You are welcome to:

  • ask a question by opening an issue or by contacting the maintainers.
  • open an issue
  • submit a pull request

Licence

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.