Extract data from google analytics and process it for use by site search
This code extracts analytics data from google analytics and processes it such that it can be used by the site search on gov.uk for improving search result quality.
For developing and testing locally, it is recommended that you use the same
Python version as the one set in
nightly-run.sh currently this is
To achieve this - if you do not already have this version available - is by
pyenv should be familiar
to anyone who has used
rbenv - in fact it is a clone of
rbenv so shares
many of the same commands and concepts.
pyenv is installed and the above version is installed (
pyenv install 3.8.13),
you should be able to
cd into the root of the project where
pyenv will read
.python-version file and load the correct version.
From there you can issue the following commands to load a Python virtual environment and
install the dependencies. Run
deactivate at anytime to exit the Python virtual
$ python -m venv venv $ source venv/bin/activate $ pip install -r requirements.txt
Testing, coverage and linting
$ python -m unittest discover $ coverage run -m unittest discover $ coverage report -m $ pylint --recursive=y ./analytics_fetcher
Authentication with Google Analytics
To make the data-fetch from Google Analytics work, you’ll need to fetch a
client_secrets.json file from google containing credentials, and use that to
generate a refresh token. This refresh token must then be passed to the script
via an environment variable.
Some details on generating these credentials are given in the GA tutorial, but in summary:
create (or already have) a google account with access to the google analytics profile for www.gov.uk.
create a project in the google developers console
For the project, go to the “APIs & auth” section on the dashboard, and ensure that the “Analytics API” is turned on.
Go to the “Credentials” section on the dashboard, and click the “Create New Client ID” button, to create a new OAuth 2.0 client ID.
Pick the “Installed Application” option, and a type of “other”
Download the JSON for the newly created client (using the “Download JSON” button underneath it).
Run the following command to generate the refresh token.
PYTHONPATH=. python scripts/setup_auth.py /path/to/client_secrets.json
It will display a url which you’ll need to open with a browser that’s signed in to the google account that the client JSON was downloaded from; paste the result into the prompt. The command will output (to stdout) a “GAAUTH” environment variable value which needs to be set when calling the fetching script.
client_secrets.jsonfile after use - it shouldn’t be needed again, and this ensures it doesn’t get leaked (eg, by committing it to git).
Run the fetch script (
scripts/fetch.py) with environment variables set. See below for details.
Don’t commit any of the generated secrets to this git repo! For regular runs from Jenkins, pass the environment variables in from the Jenkins jobs.
Ensure that the virtualenv is activated, and then run:
GAAUTH='...' PYTHONPATH=. python scripts/fetch.py page-traffic.dump 14
GAAUTH is the value obtained from the
setup_auth.py script, and
the final argument
14 is the number of days to fetch analytics data for.)
This will generate a file called
page-traffic.dump, which is in elasticsearch
bulk load format, and can be loaded into the search index using the
script in search-api. This contains information on the amount of traffic each
page on GOV.UK got (after some normalisation).
The fetching script fetches data from GA by making requests for each day’s
data. It caches the results for each day, so that it doesn’t need to repeat
all the requests when run on a subsequent day. By default, the cache is placed
in a directory “cache” at the top level of a checkout. The location of the
cache can be controlled by passing a path in the
variable. Entries which are older than 30 days will be removed from the cache
at the end of each run of the fetch script.
Running popularity update without retrieving GA data
When running the full script on integration and staging it is desirable that we don’t retrieve new data from GA.
This can be achieved with the following command:
The dump format
The dump is in Elasticsearch bulk load format. It looks like this:
vf_%i entries for each range of
days, though the nightly load script only uses 14-day ranges.
The fields are:
path_components: the path and all of its prefixes.
rank_%i: the position of that page after sorting by
vc_%i: the number of page views in the day range.
vc_%iof the page divided by the sum of the
vc_%ivalues for all pages.