Application: govuk-knowledge-graph-gcp
GOV.UK content data and cloud infrastructure for the GovSearch app
- GitHub
- govuk-knowledge-graph-gcp
- Ownership
- #insights-and-analytics-alerts
- Category
- Data engineering
- Links
README
- GOV.UK content data in BigQuery, for analytical workloads.
- Cloud infrastructure for the GovSearch app.
Documentation
Most documentation is in README.md files and docs directory in this repository. There is also GOV.UK Data Community Technical Documentation.
Documentation specific to the terraform code lives with the terraform code over in govuk-infrastructure
Data pipeline overview
- A workflow subscribes to notifications from the GOV.UK S3 Mirror that a new database backup of the Publishing API is available. The workflow creates an instance of a virtual machine.
- The virtual machine fetches the database backup file, extracts its data, and uploads that into BigQuery.
- Some SQL queries are scheduled to run daily, which call other SQL routines to refresh various tables from the newly uploaded data.
Google Groups
-
govgraph-private-data-readers has
roles/bigquery.dataViewerin relation to each BigQuery dataset except ‘test’, androles/bigquery.jobUserto be able to run queries that are billed to the billing account of thegovuk-knowledge-graph*projects. -
govgraph-developers has the
roles/ownerrole in relation to eachgovuk-knowledge-graph*project.
IAM roles/Permissions required in other projects
govuk-s3-mirror
Search for govuk-knowledge-graph in https://github.com/alphagov/govuk-s3-mirror to see what permissions are granted there. That project also publishes to Pub/Sub topics in this project.
gds-bq-reporting
The service accounts that this project uses to publish logs to the gds-bq-reporting project must be given the roles/logging.bucketWriter role in that project.
Tests
There are hardly any tests.
SQL
The most likely cause of an error in GovSearch queries is a change to the data and document schemas in the Publishing API.
It is difficult, in general, to test chains of SQL statements. DBT is popular for doing so, but adds a considerable abstraction, as well as requiring Python, which is discouraged in GOV.UK.
A scheduled query runs every hour, and raises an error if any tables have zero rows or have not been updated in the past 25 hours. The error is automatically detected in the logs, and an alert is raised, which sends an email to the govsearch-developers Google Group. Once the problem has been addressed, close the issue.
Ruby
Two of the BigQuery Remote Functions are implemented in Ruby and have unit tests. They are parse-html and html-to-text. Other BigQuery Remote Functions are somewhat trivial.
Maintainers
This project is maintained by the GOV.UK Insights and Analytics team, which is part of the Government Digital Service. They can be reached on Slack - #govuk-insights-and-analytics-team.
Common tasks
Import data from somewhere new
Look at https://github.com/alphagov/govuk-knowledge-graph-gcp/pull/594, which derives data from the Publisher app database and puts it into BigQuery.
Troubleshooting
Outdated or empty BigQuery tables
If GovSearch gives unexpected results, then the tables in BigQuery might not have been updated correctly. Usually that means a table either hasn’t been updated at all within the last 24 hours, or it has been updated and is now empty. You can quickly check every table by querying a view called test.tables-metadata by writing a query like SELECT * FROM test.tables-metadata;. The table is checked automatically every hour, and if it finds old or empty tables then an ‘incident’ is created, and an email is sent to govgraph-developers@digital.cabinet-office.gov.uk.
Source data glitch
Check that the database backup files in the govuk-s3-mirror are the expected size (many gigabytes) by looking in the bucket.
Check that the Publishing API hasn’t changed its schemas.
Other representations of GOV.UK content
There are several different representations of GOV.UK content, including:
- Publishing API
- Content Store
- Search API
- CDN cache (content delivery network)
- Mirror (HTML pages crawled nightly)
- National Archives (snapshots of content over time)
None of these representations met a need for advanced searching and filtering for content designers, or a need for low-level structured data for developing data science applications. Hence the Knowledge Graph was developed.
Technical debt
See Technical debt.
Contributing
You are welcome to:
- ask a question or suggest a change by contacting the maintainers.
- submit a pull request.
Deployment
Docker
Any changes to code in directories under the docker and src folders are deployed to all environments by GitHub Actions CI on merging to main. A few key things to be aware of:
- Each folder corresponds to an image
- The name of the folder will be the name of the image
- The deployment pipeline assumes the image is to be deployed as a Cloud Run function. If that is not the case (e.g. the image is pulled at runtime by a VM) then you will have to exclude the image in the ‘Cloud Run deploy’ step explicitly in the build-push-deploy.yml workflow.
Testing changes in dev environment
You can deploy your code changes into the govuk-knowledge-graph-dev GCP project. To test changes to source code or docker images in the development environment:
- Create a feature branch, make your changes and push to github
- Go to Actions and find the “Manual Dev-Only Build” workflow (
manual-dev-build.yml) - Select “Run workflow” and choose your branch from the dropdown. Click “Run workflow”
Once the workflow has completed, a new image will be available in the Artfiact Registry tagged with the name of your branch.
Licence
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.
The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.