Email Alert API analytics
Email Alert API does not currently have many automated means to generate analytics and data insights. This document serves as a guide to understand how to produce this information.
How data is organised
Most of the data in this application is stored in the database for perpetuity, thus for most cases querying the database through Rails console or through PostgreSQL is appropriate. However as this application creates such a large volume of emails these are instead archived with minimal data in Amazon S3 and can be queried with Amazon Athena.
Once an email has been sent or is no longer being sent (such as a
permanent failure) the record will be archived in a reduced form and sent to
Amazon S3 in hourly batches. After 14 days the email will be removed from the
database. Thus for recent information the
emails database table can be used
otherwise the email archive should be queried.
Analytics through Rails console
A large number of analytic insights can be discovered by working with the database schema and querying the data. The intention is that when useful common queries are identified they will be made accessible via API endpoints or rake tasks.
Here are some example queries to pull out particular insights.
How many subscribers does a list have at a particular point in time
Subscription.where( "created_at > ? AND (ended_at IS NULL OR ended_at <= ?)", "2018-06-01", "2018-07-01" ).where(subscriber_list_id: 4194).count
Lists with most new subscriptions in a time frame
pp Subscription.where( "created_at BETWEEN ? AND ?", "2018-06-01", "2018-07-01" ).group( :subscriber_list_id ).having( "count(*) > 10" ).order( count: :desc ).pluck(:subscriber_list_id, Arel.sql("COUNT(*) AS count"))
Lists with most ended subscriptions in a time frame
pp Subscription.where( "ended_at BETWEEN ? AND ?", "2018-06-01", "2018-07-01" ).group( :subscriber_list_id ).having( "count(*) > 10" ).order( count: :desc ).pluck(:subscriber_list_id, Arel.sql("COUNT(*) AS count"))
How many emails did subscribers were generated emails for digest runs
pp DigestRun.where( "date >= ? AND date <= ?", "2018-06-04", "2018-06-10" ).where( range: :daily ).order(date: :asc).pluck(:date, :subscriber_count)
Which hours are content_changes being created at
pp ContentChange.group("EXTRACT(HOUR FROM created_at)").count
Which days of the week have the most content changes
pp ContentChange.group("to_char(created_at, 'Day')").count
How many content changes per subscription list in a timeframe
pp MatchedContentChange.joins(:content_change).joins(:subscriber_list).where( "content_changes.created_at BETWEEN ? AND ?", "2018-06-01", "2018-07-01" ).group( :subscriber_list_id, "subscriber_lists.title" ).order("COUNT(*) DESC").limit(10).pluck( :subscriber_list_id, "subscriber_lists.title", Arel.sql("COUNT(*)") )
Analytics through Athena
Athena is accessible through the AWS control panel which can be
accessed by following the instructions provided in the
developer docs. To access the production data you will need to use the
govuk-infrastructure-production account, once there you can head to
athena and select the
The data stored in Athena is the email id; when the sending process completed; whether the email sent; the email subject of the email; the associations to content change and digests; and timestamps for when the email was created and archived. The data is arranged by partitions of the date when the email finished sending - It is vastly cheaper and faster to query with partitions.
Always query with partitions
You should always query with a where condition which defines the partitions
to be used in your result set e.g.
WHERE year=2018 AND month=7 AND date=4
unless you are sure you need a wider data range.
The data is stored in directories which separate the data by year, month and
date values. By applying a partition to the query, such as
WHERE year=2018 AND
month=7 AND date=4 you reduce the data needed to be traversed in the query
to just the files from that single day. Which naturally makes the query
perform substantially quicker.
Each query against Athena has a monetary cost - at time of writing $5 per TB of data scanned - and by using partitions you massively reduce the data that needs to be scanned.
Look up emails sent/failed on a particular day
SELECT sent, count(*) AS count FROM "email_alert_api_archive"."email_archive" WHERE year = 2018 AND month = 7 AND date = 4 GROUP BY sent;
Look up an email by id
SELECT * FROM "email_alert_api_archive"."email_archive" WHERE id = 'fc294e14-3b09-4869-ab07-a5c72ed04a01' AND year = 2018 AND month = 7 AND date = 5;
Look up emails associated with a content change
SELECT * FROM "email_alert_api_archive"."email_archive" WHERE CONTAINS(content_change.content_change_ids, 'bfd76384-1a1d-4da8-bc65-a79d9cb270d6') AND year = 2018 AND month = 7 LIMIT 10;
Count emails a subscriber was sent per subscription for a time period
SELECT content_change.subscription_ids, COUNT(*) AS count FROM "email_alert_api_archive"."email_archive" WHERE subscriber_id = 4840 AND content_change.digest_run_id IS NULL AND sent = true AND year = 2018 GROUP BY content_change.subscription_ids ORDER BY count DESC LIMIT 10;