Warning This document has not been updated for a while now. It may be out of date.

Last updated: 9 Sep 2021

govuk-data-science-workshop: Pre-commit hooks

This repository uses the Python package pre-commit to manage pre-commit hooks. Pre-commit hooks are actions which are run automatically, typically on each commit, to perform some common set of tasks. For example, a pre-commit hook might be used to run any code linting automatically before code is committed, ensuring common code quality.

Purpose

For this repository, we are using pre-commit for a number of purposes:

checking for secrets being committed accidentally — there is a strict definition of a "secret"; and
checking for any large files (over 5 MB) being committed.
cleaning Jupyter notebooks, which means removing all outputs, execution counts, Python kernels, and, for Google Colaboratory (Colab), stripping out user information.

We have configured pre-commit to run automatically on every commit. By running on each commit, we ensure that pre-commit will be able to detect all contraventions and keep our repository in a healthy state.

No pre-commit hooks will be run on Google Colab notebooks pushed directly to GitHub.
For security reasons, it is recommended that you manually download your notebook, and
commit up locally to ensure pre-commit hooks are run on your changes.

Installation

In order for pre-commit to run, action is needed to configure it on your system.

install the pre-commit package into your Python environment from requirements.txt; and
run pre-commit install in your terminal to set up pre-commit to run when code is committed.

Using the `detect-secrets` pre-commit hook

The `detect-secrets` package does its best to prevent accidental committing of secrets,
but it may miss things. Instead, focus on good software development practices! See the
[definition of a secret for further
information](#definition-of-a-secret-according-to-detect-secrets).

We use detect-secrets to check that no secrets are accidentally committed. This hook requires you to generate a baseline file if one is not already present within the root directory. To create the baseline file, run the following at the root of the repository:

detect-secrets scan > .secrets.baseline

Next, audit the baseline that has been generated by running:

detect-secrets audit .secrets.baseline

When you run this command, you'll enter an interactive console. This will present you with a list of high-entropy string and/or anything which could be a secret. It will then ask you to verify whether this is the case. This allows the hook to remember false positives in the future, and alert you to new secrets.

Definition of a "secret" according to `detect-secrets`

The detect-secrets documentation, as of January 2021, says it works:

...by running periodic diff outputs against heuristically crafted [regular expression] statements, to identify whether any new secret has been committed.

This means it uses regular expression patterns to scan your code changes for anything that looks like a secret according to the patterns. By definition, there are only a limited number of patterns, so the detect-secrets package cannot detect every conceivable type of secret.

To understand what types of secrets will be detected, read the detect-secrets documentation on caveats, and the list of supported plugins. Also, you should use secret variable names with words that will trip the KeywordDetector plugin; see the [DENYLIST variable for the full list of words][detect-secrets-keyword-detector].

If `pre-commit` detects secrets during commit

If pre-commit detects any secrets when you try to create a commit, it will detail what it found and where to go to check the secret.

If the detected secret is a false positive, there are two options to resolve this, and prevent your commit from being blocked:

inline allowlisting of false positives (recommended); or
updating the .secrets.baseline to include the false positives.

In either case, if an actual secret is detected (or a combination of actual secrets and false positives), first remove the actual secret. Then following either of these processes.

Inline allowlisting (recommended)

To exclude a false positive, add a pragma comment such as:

secret = "Password123"  # pragma: allowlist secret

#  pragma: allowlist nextline secret
secret = "Password123"

If the detected secret is actually a secret (or other sensitive information), remove the secret and re-commit; there is no need to add any pragma comments.

If your commit contains a mixture of false positives and actual secrets, remove the actual secrets first before adding pragma comments to the false positives.

Updating `.secrets.baseline`

To exclude a false positive, you can also update the .secrets.baseline by repeating the same two commands as in the initial setup.

During auditing, if the detected secret is actually a secret (or other sensitive information), remove the secret and re-commit. There is no need to update the .secrets.baseline file in this case.

If your commit contains a mixture of false positives and actual secrets, remove the actual secrets first before updating and auditing the .secrets.baseline file.

Keeping specific Jupyter notebook outputs

It may be necessary or useful to keep certain output cells of a Jupyter notebook, for example charts or graphs visualising some set of data. To do this, according to the documentation for the nbstripout package, either:

add a keep_output tag to the desired cell; or
add "keep_output": true to the desired cell's metadata.

You can access cell tags or metadata in Jupyter by enabling the "Tags" or "Edit Metadata" toolbar (View > Cell Toolbar > Tags; View > Cell Toolbar > Edit Metadata).

For the tags approach, enter keep_output in the text field for each desired cell, and press the "Add tag" button. For the metadata approach, press the "Edit Metadata" button on each desired cell, and edit the metadata to look like this:

{
  "keep_output": true
}

This will tell the hook not to strip the resulting output of the desired cell(s), allowing the output(s) to be committed.

Currently (March 2020) there is no way to add tags and/or metadata to Google Colab
notebooks.

It's strongly suggested that you download the Colab as a .ipynb file, and edit tags
and/or metadata using Jupyter before committing the code if you want to keep some
outputs.

govuk-data-science-workshop: Pre-commit hooks

Purpose

Installation

Using the detect-secrets pre-commit hook

Definition of a "secret" according to detect-secrets

If pre-commit detects secrets during commit