content-data-api app healthcheck not ok
See also: how healthcheck alerts work on GOV.UK
If there is a health check error showing for Content Data API, click on the alert to find out more details about what’s wrong.
You can visit the healthcheck endpoints here:
- https://content-data-api.publishing.service.gov.uk/healthcheck/metrics
- https://content-data-api.publishing.service.gov.uk/healthcheck/search
- https://content-data-api.publishing.service.gov.uk/healthcheck/live
- https://content-data-api.publishing.service.gov.uk/healthcheck/ready
What is the ETL process
ETL stands for Extract, Transform, Load. Every day, data is copied from multiple sources (the publishing platform, [user feedback], and Google Analytics) into the Content Data API warehouse.
A rake task called etl:master
calls the Etl::Master::MasterProcessor
which processes all the data. This rake task is run daily
(see “When does ETL run” section below) so that the
Content Data app has up to date figures.
The Jenkins job that calls this rake task is content_data_api (configured here).
There is also a special ‘re-run’ task called etl:rerun_master
,
which takes an inclusive range of dates as arguments, and runs the same task
as above but overriding the previously held data. We can run this if we have
reason to believe the historical data is no longer accurate.
The Jenkins job for this rake task is content_data_api_re_run (configured here).
When does ETL run
The content_data_api job runs at:
These jobs are spread out for rate limiting reasons, and production is run outside of normal hours so as not to impact database performance during the day.
The content_data_api_re_run job runs at 3AM in every environment. This job was added because of a delay in results showing up in Google Analytics, meaning results can take between 24-48 hours to appear in GA. The production ETL running at 7am allowed only 7 hours for the data to appear in GA. The re-run job collects the data after 2 days, leaving time for the data to appear correctly in GA.
Troubleshooting
Below are the possible problems you may see. Note that the rake tasks should
be run on the content-data-api
TARGET_APPLICATION and the backend
MACHINE_CLASS.
All dates for the rake tasks below are inclusive. In other words, if you only need to reprocess data for a specific day, you’ll need to use the same the date for both the ‘from’ and ‘to’ parameters (for example:
etl:repopulate_aggregations_month["2019-12-15","2019-12-15"]
).
ETL :: no monthly aggregations of metrics for yesterday
This means that the ETL master process that runs daily that creates aggregations of the metrics failed.
To fix this problem run the following rake task:
- Run etl:repopulate_aggregations_month[YYYY-MM-DD,YYYY-MM-DD] on Integration
- Run etl:repopulate_aggregations_month[YYYY-MM-DD,YYYY-MM-DD] on Staging
- ⚠️ Run etl:repopulate_aggregations_month[YYYY-MM-DD,YYYY-MM-DD] on Production ⚠️
ETL :: no searches updated from yesterday
This means that the Etl process that runs daily and refreshes the Materialized Views failed to update those views.
To fix this problem run the following rake task:
- Run etl:repopulate_aggregations_search on Integration
- Run etl:repopulate_aggregations_search on Staging
- ⚠️ Run etl:repopulate_aggregations_search on Production ⚠️
ETL :: no daily metrics for yesterday
This means that the ETL master process that runs daily to retrieve metrics for content items has failed.
To fix this problem re-run the master process again
Note This will first delete any metrics that had been successfully retrieved before re-running the task to regather all metrics.
ETL :: no pviews for yesterday
This means the the ETL master process that runs daily has failed to
collect pageview
metrics from Google Analytics. The issue may originate
from the ETL processor responsible for collecting core metrics.
To fix this problem run the following rake task:
- Run etl:repopulateviews[YYYY-MM-DD,YYYY-MM-DD] on Integration
- Run etl:repopulateviews[YYYY-MM-DD,YYYY-MM-DD] on Staging
- ⚠️ Run etl:repopulateviews[YYYY-MM-DD,YYYY-MM-DD] on Production ⚠️
ETL :: no upviews for yesterday
This means the the ETL master process that runs daily has failed to
collect unique pageview
metrics from Google Analytics. The issue may
originate from the ETL processor responsible for collecting core metrics.
To fix this problem run the following rake task:
- Run etl:repopulateviews[YYYY-MM-DD,YYYY-MM-DD] on Integration
- Run etl:repopulateviews[YYYY-MM-DD,YYYY-MM-DD] on Staging
- ⚠️ Run etl:repopulateviews[YYYY-MM-DD,YYYY-MM-DD] on Production ⚠️
ETL :: no searches for yesterday
This means the the ETL master process that runs daily has failed to
collect number of searches
metrics from Google Analytics. The issue may
originate from the ETL processor responsible for collecting Internal Searches.
To fix this problem run the following rake task:
- Run etl:repopulate_searches[YYYY-MM-DD,YYYY-MM-DD] on Integration
- Run etl:repopulate_searches[YYYY-MM-DD,YYYY-MM-DD] on Staging
- ⚠️ Run etl:repopulate_searches[YYYY-MM-DD,YYYY-MM-DD] on Production ⚠️
ETL :: no feedex for yesterday
This means the the ETL master process that runs daily has failed to
collect feedex
metrics from support-api
. The issue may originate from the
ETL processor responsible for collecting Feedex comments.
To fix this problem run the following rake task:
- Run etl:repopulate_feedex[YYYY-MM-DD,YYYY-MM-DD] on Integration
- Run etl:repopulate_feedex[YYYY-MM-DD,YYYY-MM-DD] on Staging
- ⚠️ Run etl:repopulate_feedex[YYYY-MM-DD,YYYY-MM-DD] on Production ⚠️
Other troubleshooting tips
For problems in the ETL process, you can check the output in Jenkins.
You can also check for any errors in Sentry or the logs in kibana
sidekiq_retry_size is above the warning threshold
We have an ongoing issue where occasionally a Whitehall document is not successfully
updated in the database, returning with a database conflict error.
In the content-data-api
you may find the first content item is (incorrectly)
still considered live: true
, and that the second content item doesn’t exist.
We are notified of this error through Sentry as well as a warning
Icinga
alert on the content-data-api
healthcheck, specifically that sidekiq_retry_size
- warning - 3 is above the warning threshold (1)
because content-data-api
is
unable to pull in data.
To fix this and get the data successfully pulled into content-data-api
you can
apply this manual fix:
Dimensions::Edition.where(content_id: "b6a2a286-8669-4cbe-a4ad-7997608598d2")
.last
.update!(live: false, schema_name: "gone", document_type: "gone")
You can get the base url
from the Sentry error and query the database for the
content_id
.
Then the next time the Sidekiq worker runs, it should successfully be able to add the new content item.