Skip to main content
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert
Last updated: 23 May 2022

content-data-api app healthcheck not ok

See also: how healthcheck alerts work on GOV.UK

If there is a health check error showing for Content Data API, click on the alert to find out more details about what’s wrong.

You can visit the healthcheck endpoints here:

  • https://content-data-api.publishing.service.gov.uk/healthcheck/metrics
  • https://content-data-api.publishing.service.gov.uk/healthcheck/search
  • https://content-data-api.publishing.service.gov.uk/healthcheck/live
  • https://content-data-api.publishing.service.gov.uk/healthcheck/ready

What is the ETL process

ETL stands for Extract, Transform, Load. Every day, data is copied from multiple sources (the publishing platform, [user feedback], and Google Analytics) into the Content Data API warehouse.

A rake task called etl:master calls the Etl::Master::MasterProcessor which processes all the data. This rake task is run daily (see “When does ETL run” section below) so that the Content Data app has up to date figures.

The Jenkins job that calls this rake task is content_data_api (configured here).

There is also a special ‘re-run’ task called etl:rerun_master, which takes an inclusive range of dates as arguments, and runs the same task as above but overriding the previously held data. We can run this if we have reason to believe the historical data is no longer accurate.

The Jenkins job for this rake task is content_data_api_re_run (configured here).

When does ETL run

The content_data_api job runs at:

These jobs are spread out for rate limiting reasons, and production is run outside of normal hours so as not to impact database performance during the day.

The content_data_api_re_run job runs at 3AM in every environment. This job was added because of a delay in results showing up in Google Analytics, meaning results can take between 24-48 hours to appear in GA. The production ETL running at 7am allowed only 7 hours for the data to appear in GA. The re-run job collects the data after 2 days, leaving time for the data to appear correctly in GA.

Troubleshooting

Below are the possible problems you may see. Note that the rake tasks should be run on the content-data-api TARGET_APPLICATION and the backend MACHINE_CLASS.

All dates for the rake tasks below are inclusive. In other words, if you only need to reprocess data for a specific day, you’ll need to use the same the date for both the ‘from’ and ‘to’ parameters (for example: etl:repopulate_aggregations_month["2019-12-15","2019-12-15"]).

ETL :: no monthly aggregations of metrics for yesterday

This means that the ETL master process that runs daily that creates aggregations of the metrics failed.

To fix this problem run the following rake task:

ETL :: no searches updated from yesterday

This means that the Etl process that runs daily and refreshes the Materialized Views failed to update those views.

To fix this problem run the following rake task:

ETL :: no daily metrics for yesterday

This means that the ETL master process that runs daily to retrieve metrics for content items has failed.

To fix this problem re-run the master process again

Note This will first delete any metrics that had been successfully retrieved before re-running the task to regather all metrics.

ETL :: no pviews for yesterday

This means the the ETL master process that runs daily has failed to collect pageview metrics from Google Analytics. The issue may originate from the ETL processor responsible for collecting core metrics.

To fix this problem run the following rake task:

ETL :: no upviews for yesterday

This means the the ETL master process that runs daily has failed to collect unique pageview metrics from Google Analytics. The issue may originate from the ETL processor responsible for collecting core metrics.

To fix this problem run the following rake task:

ETL :: no searches for yesterday

This means the the ETL master process that runs daily has failed to collect number of searches metrics from Google Analytics. The issue may originate from the ETL processor responsible for collecting Internal Searches.

To fix this problem run the following rake task:

ETL :: no feedex for yesterday

This means the the ETL master process that runs daily has failed to collect feedex metrics from support-api. The issue may originate from the ETL processor responsible for collecting Feedex comments.

To fix this problem run the following rake task:

Other troubleshooting tips

For problems in the ETL process, you can check the output in Jenkins.

You can also check for any errors in Sentry or the logs in kibana

sidekiq_retry_size is above the warning threshold

We have an ongoing issue where occasionally a Whitehall document is not successfully updated in the database, returning with a database conflict error. In the content-data-api you may find the first content item is (incorrectly) still considered live: true, and that the second content item doesn’t exist.

We are notified of this error through Sentry as well as a warning Icinga alert on the content-data-api healthcheck, specifically that sidekiq_retry_size - warning - 3 is above the warning threshold (1) because content-data-api is unable to pull in data.

To fix this and get the data successfully pulled into content-data-api you can apply this manual fix:

Dimensions::Edition.where(content_id: "b6a2a286-8669-4cbe-a4ad-7997608598d2")
  .last
  .update!(live: false, schema_name: "gone", document_type: "gone")

You can get the base url from the Sentry error and query the database for the content_id.

Then the next time the Sidekiq worker runs, it should successfully be able to add the new content item.