Query CDN logs
Previously, the log files were sent via syslog and available in real time
/var/log/cdn directory of the
monitoring server. Due to low use now
we have Athena, these were removed,
but the behaviour can be restored if necessary.
Amazon Athena provides the means to query the logs through an SQL syntax.
The logs appear in a database named
fastly_logs and there are 3 logs from
the CDN which are available as tables:
- bouncer - legacy government websites that are now redirected to GOV.UK
- govuk_www - content served from www.gov.uk, mainly HTML pages with atom feeds and web services
Athena is available through the AWS control panel and is only useful in
govuk_infrastructure_production account (as we barely use the ones for
integration and staging). To access,
log into AWS, navigate to
Athena and select the
A basic query could be:
SELECT * FROM fastly_logs.govuk_www WHERE date = 20 AND month = 8 AND year = 2018 ORDER BY request_received DESC LIMIT 10
Take note of the use of a date to restrict the data, this utilises a data partition so that the query only traverses a subset of data.
Always query with a partition
Unless you have a good reason to do otherwise all queries to the CDN logs should be restricted to just the dates that you care about. Doing this makes the queries:
- substantially faster - far less data needs to be traversed and there is a lower chance of a query timeout
- substantially cheaper - the cost of using Athena is based on the amount of data that is traversed
This is a selection of queries put together to show some of the ways to query the CDN logs. You are encouraged to add to this list if you write useful queries for particular scenarios.
How many errors were served to users in a timeframe
SELECT status, COUNT(*) AS count FROM fastly_logs.govuk_www WHERE status >= 500 AND status <= 599 AND request_received >= TIMESTAMP '2018-08-20 08:00' AND request_received < TIMESTAMP '2018-08-20 12:00' AND date = 20 AND month = 8 AND year = 2018 GROUP BY status ORDER BY count DESC
SELECT url, COUNT(*) AS count FROM fastly_logs.govuk_www WHERE status = 500 AND date = 20 AND month = 8 AND year = 2018 GROUP BY url ORDER BY count DESC
Which GOV.UK pages changed from 200 to a 410 status codes
SELECT response_200.url, MAX(response_200.request_received) AS last_200_response, MIN(response_non_200.request_received) AS first_non_200_response FROM fastly_logs.govuk_www AS response_200 INNER JOIN fastly_logs.govuk_www AS response_non_200 ON response_200.url = response_non_200.url AND response_non_200.method = response_200.method AND response_non_200.status = 410 AND response_non_200.request_received > response_200.request_received AND response_non_200.date = response_200.date AND response_non_200.month = response_200.month AND response_non_200.year = response_200.year WHERE response_200.status = 200 AND response_200.date = 21 AND response_200.month = 8 AND response_200.year = 2018 GROUP BY response_200.url LIMIT 100;
Finding out how frequently a GOV.UK URL is accessed
SELECT date_trunc('hour', request_received) AS hour, COUNT(*) AS count FROM fastly_logs.govuk_www WHERE url = '/vat-rates' AND date = 20 AND month = 8 AND year = 2018 GROUP BY date_trunc('hour', request_received) ORDER BY date_trunc('hour', request_received) ASC;
Which assets are being serving most frequently
SELECT url, COUNT(*) AS count FROM fastly_logs.govuk_assets WHERE date = 20 AND month = 8 AND year = 2018 GROUP BY url ORDER BY count DESC LIMIT 50
What are the largest assets that were served
SELECT url, bytes, COUNT(*) AS served FROM fastly_logs.govuk_assets WHERE date = 20 AND month = 8 AND year = 2018 GROUP BY url, bytes ORDER BY bytes DESC LIMIT 50
Which user agents and hosts is bouncer serving to the most
SELECT user_agent, host, COUNT(*) AS count FROM fastly_logs.bouncer WHERE date = 20 AND month = 8 AND year = 2018 GROUP BY user_agent, host ORDER BY count DESC LIMIT 50
Adding a new field to the CDN logs
Adding a new field to the CDN logs is a half manual, half automated process and is tracked as tech debt.
You should do this in the Integration environment first and wait until
you see the logs coming through to Athena from Fastly. Only then
should you change Staging and next Production. This helps us catch
syntax errors and incidents early. Note that the
bouncer logs are
only available in Production.
Firstly, make the manual changes via Terraform and a PR. This ensures that you’ve had two pairs of eyes on both the Athena/AWS Glue config changes and the eventual JSON you will manually put into the Fastly web UI:
- Edit the [
- Find the
aws_glue_catalog_tableresource for the Fastly logs you want to add a column to (
- Copy the VCL for the log from
Fastly’s list of available logs and add it to the list
of commented (
//) columns. It’s important that these are in the same format as the preceding commented lines (that is, JSON, with quotes) otherwise Athena won’t parse the logs correctly. Here’s an example for the
- Add your new column name, type and description to the JSON below the
- Raise a PR, get a review, merge and deploy.
The manual steps to make Fastly send the log data:
- Log in to Fastly using the credentials from
fastly/deployment_shared_credentialsin the 2ndline password store.
- Search for the environment, for example “staging”.
- On the environment page, click “Real Time Stats”.
- Click “Configure” on the page with the graphs.
- Click the blue “Edit Configuration” button on the far right to reveal a dropdown menu.
- Click “Clone version xxx (active) to edit”.
- Click “Logging” in the left hand menu.
- Find and click the link for the relevant log, for example “GOV.UK Fastly Logs S3 Bucket”.
- Paste the new log format (found in the list of available Fastly logs) into the “Log Format” text box.
- Click “Update”.
- Click the purple “ACTIVATE” button in the top right corner to make the new configuration live.
Both these sets of steps must be done! Check the S3 bucket and query Athena to see the added column and confirm that there’s data for it. As the crawler runs on a schedule every four hours, it can take a while for Athena to recognise that there have been changes to the configuration. Due to this, it’s advisable to manually run the AWS Glue Crawler for the logs config you have changed once you know that Fastly is correct.
Once you’re happy that the Integration configuration works, you can deploy Terraform to Staging and make the same manual changes in the Fastly UI. Then do Production.
HIVE_CURSOR_ERROR: Row is not a valid JSON Object
This error indicates that a row in the logs is invalid JSON. It can be very difficult to determine which file this error came from. It could either be caused by a change in the formatting of data sent from Fastly - which would make every record after a timestamp invalid - or by something expected in input which isn’t properly escaped.
There is a tedious process that can be used to identify where there is a
problem in a particular JSON file. You can sync all the log files from a day
to your local machine and then parse each JSON blob in each of the files with a
JSON tool (such as
JSON.parse in Ruby) until you find the problem. Once
identified you may need to need to contact Fastly about the
problem or update the log formatting in Fastly to resolve the issue.