Skip to main content
Warning This document has not been updated for a while now. It may be out of date.
Last updated: 27 Jul 2023

Environment data sync

The govuk_env_sync data sync is used for backup and sync tasks in the GOV.UK AWS environments. It replaces the now archived env-sync-and-backup data sync.

What it does

In this context:

  • Push means database dump and upload.
  • Pull means download and database restore.

The govuk_env_sync data sync works by pushing a source database to S3 and subsequently pulling it down from there to a destination. There is a different process for the Elasticsearch databases, which don’t push/pull but instead get copied to another Elasticsearch.

The environment synchronisation is achieved by granting cross-account access of the govuk-<environment>-database-backups S3 buckets.

  • In production, databases push to the govuk-production-database-backups S3 bucket.
  • In staging, databases are replaced by pulling data from govuk-production-database-backups.
  • In integration, databases are generally replaced by pulling data from govuk-production-database-backups. However, some databases have a data sanitisation process, which happens in staging. In those cases, databases are replaced by pulling from govuk-staging-database-backups.
  • In integration, databases are pushed to govuk-integration-database-backups nightly - these are produced for developers to use real data on their local machines.

Data sanitisation

Data sanitisation (removal of sensitive data) is done by SQL scripts in the files/transformation_sql directory, which are created under /etc/govuk_env_sync/transformation_sql/ on the target machine.

Data sanitisation is applied to the Integration environment for a number of data sources, including Publishing API and Email Alert API, as good security practice. It is not applied to Staging, which can only be accessed by those who have Production access and therefore have access to the Production equivalents anyway. Recreating the Production data on Staging allows us to test queries on real world data before we apply them in a Production environment.

How it works

The code is in the govuk_env_sync module in the govuk-puppet repository. Most of the logic is in the govuk_env_sync.sh file, provided as a Puppet file resource in the files directory. It is created as /usr/local/bin/govuk_env_sync.sh. Rollout of changes to this code happens as part of the the Deploy_puppet Jenkins job in each environment.

A govuk_env_sync::task resource type creates a configuration file and a cron job, parameterising the govuk_env_sync.sh file with the values passed to govuk_env_sync::task. A create_resources(govuk_env_sync::task) invocation calls the govuk_env_sync::task for each govuk_env_sync::tasks: property in the hieradata.

The configuration for these tasks is found in the node hieradata for the db_admin machines, for example: hieradata_aws/class/integration/db_admin.yaml. All can be found via git grep govuk_env_sync::tasks within the govuk-puppet repo.

There are separate backup and restore tasks for each database in each environment, all with different start times, so it is difficult to pinpoint exactly when the govuk_env_sync starts and ends. However, the data sync period is generally expected to run from around 10pm until 8am. For this reason:

Traffic replay using GoReplay is disabled between 22:00 and 08:00 UTC daily whilst the data sync pull jobs take place. This is to prevent lots of errors while we are dropping databases.

Pausing a data sync

If you need to temporarily pause one of the data syncs, it can be done manually:

  1. On the appropriate db_admin machine for the environment you want to disable the sync in, Puppet will need to be disabled so the paused jobs aren’t automatically restarted:
   $ govuk_puppet --disable "paused for data sync"
  1. You can then list the cron jobs to find the right pull job for the app(s), then edit the crontab and remove the corresponding line(s):
   # list
   $ sudo crontab -lu govuk-backup
   # edit
   $ sudo crontab -u govuk-backup -e
  1. This will need to be done separately in each environment where the jobs need to be paused.

Resuming a data sync

To resume the jobs again you can re-enable Puppet and run it manually:

$ govuk_puppet --enable
$ govuk_puppet --test

Resources managed by Puppet

Configuration files on machines

Puppet creates a configuration file in /etc/govuk_env_sync/ for each job in hieradata govuk_env_sync::tasks:. These files consist of simple source-able variable assignments of the form:

action="pull"
dbms="postgresql"
storagebackend="s3"
temppath="/tmp/content_data_admin_production_pull"
database="content_data_admin_production"
url="govuk-production-database-backups"
path="postgresql-backend"

Lock

The govuk_env_sync cron jobs prevent automated reboots by unattended-upgrades by running under /usr/local/bin/with_reboot_lock, which creates the file /etc/unattended-reboot/no-reboot/govuk_env_sync and removes it when the process exits.

Cron jobs and Icinga checks

If you get an Icinga alert about a failing task, check /var/log/syslog and /var/log/syslog.1 on the machine which runs the job. If the logs don’t help, you can try re-running the sync job.

The data sync operations are executed as cron-jobs attached to the govuk-backup user. Run the following commands to get an overview of the jobs being run on a machine.

$ sudo crontab -lu govuk-backup

# Puppet Name: pull_content_data_admin_production_daily
18 0 * * * /usr/bin/ionice -c 2 -n 6 /usr/local/bin/with_reboot_lock /usr/bin/envdir /etc/govuk_env_sync/env.d /usr/local/bin/govuk_env_sync.sh -f /etc/govuk_env_sync/pull_content_data_admin_production_daily.cfg
...

The cron job command does the following:

  1. Runs the data sync job at low I/O priority: /usr/bin/ionice -c 2 -n 6. This only really matters when running on a database server, as opposed to a db_admin bastion host, but the command is the same in both cases.
  2. Prevents reboot by unattended-upgrades while the sync job is running: /usr/local/bin/with_reboot_lock
  3. Runs the data sync job with the appropriate configuration file: /usr/local/bin/govuk_env_sync.sh -f /etc/govuk_env_sync/pull_content_data_admin_production_daily.cfg

To re-run a given sync job, copy the part of the cron-job corresponding to (3) and examine the output for any errors.

sudo -u govuk-backup /usr/local/bin/govuk_env_sync.sh -f /etc/govuk_env_sync/pull_content_data_admin_production_daily.cfg