Application: govuk-knowledge-graph-gcp
GOV.UK content data and cloud infrastructure for the GovSearch app
- GitHub
- govuk-knowledge-graph-gcp
- Ownership
- #data-engineering
- Category
- Data science
- Links
README
- GOV.UK content data in BigQuery, for analytical workloads.
- Cloud infrastructure for the GovSearch app.
Documentation
Most documentation is in README.md
files and docs
directory in this repository. There is also GOV.UK Data Community Technical Documentation.
Data pipeline overview
- A workflow subscribes to notifications from the GOV.UK S3 Mirror that a new database backup of the Publishing API is available. The workflow creates an instance of a virtual machine.
- The virtual machine fetches the database backup file, extracts its data, and uploads that into BigQuery.
- Some SQL queries are scheduled to run daily, which call other SQL routines to refresh various tables from the newly uploaded data.
Access and permissions
People are granted access by membership of Google Groups. Other Google Cloud Platform projects are granted access via service accounts. Access is granted by editing each environment’s tfvars file, such as terraform-dev/environment.auto.tfvars
.
Google Groups
-
govgraph-private-data-viewers has
roles/bigquery.dataViewer
in relation to each BigQuery dataset except ‘test’, androles/bigquery.jobUser
to be able to run queries that are billed to the billing account of thegovuk-knowledge-graph*
projects. -
govgraph-developers has the
roles/owner
role in relation to eachgovuk-knowledge-graph*
project.
Tests
There are hardly any tests.
SQL
The most likely cause of an error in GovSearch queries is a change to the data and document schemas in the Publishing API.
It is difficult, in general, to test chains of SQL statements. DBT is popular for doing so, but adds a considerable abstraction, as well as requiring Python, which is discouraged in GOV.UK.
A scheduled query runs every hour, and raises an error if any tables have zero rows or have not been updated in the past 25 hours. The error is automatically detected in the logs, and an alert is raised, which sends an email to the govsearch-developers Google Group. Once the problem has been addressed, close the issue.
Ruby
Two of the BigQuery Remote Functions are implemented in Ruby and have unit tests. They are parse-html and html-to-text. Other BigQuery Remote Functions are somewhat trivial.
Maintainers
This project is maintained by the GOV.UK team, which is part of the Government Digital Service.
Common tasks
Import data from somewhere new
Look at https://github.com/alphagov/govuk-knowledge-graph-gcp/pull/594, which derives data from the Publisher app database and puts it into BigQuery.
Troubleshooting
Outdated or empty BigQuery tables
If GovSearch gives unexpected results, then the tables in BigQuery might not have been updated correctly. Usually that means a table either hasn’t been updated at all within the last 24 hours, or it has been updated and is now empty. You can quickly check every table by querying a view called test.tables-metadata
by writing a query like SELECT * FROM test.tables-metadata;
. The table is checked automatically every hour, and if it finds old or empty tables then an ‘incident’ is created, and an email is sent to govgraph-developers@digital.cabinet-office.gov.uk.
Source data glitch
Check that the database backup files in the govuk-s3-mirror are the expected size (many gigabytes) by looking in the bucket.
Check that the Publishing API hasn’t changed its schemas.
Other representations of GOV.UK content
There are several different representations of GOV.UK content, including:
- Publishing API
- Content Store
- Search API
- CDN cache (content delivery network)
- Mirror (HTML pages crawled nightly)
- National Archives (snapshots of content over time)
None of these representations met a need for advanced searching and filtering for content designers, or a need for low-level structured data for developing data science applications. Hence the Knowledge Graph was developed.
Technical debt
See Technical debt.
Contributing
You are welcome to:
- ask a question by opening an issue or by contacting the maintainers.
- open an issue
- submit a pull request
Licence
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.
The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.