govuk-data-science-workshop: Pre-commit hooks
This repository uses the Python package pre-commit
to manage pre-commit
hooks. Pre-commit hooks are actions which are run automatically, typically
on each commit, to perform some common set of tasks. For example, a pre-commit hook
might be used to run any code linting automatically before code is committed, ensuring
common code quality.
Purpose
For this repository, we are using pre-commit
for a number of purposes:
- checking for secrets being committed accidentally — there is a strict definition of a "secret"; and
- checking for any large files (over 5 MB) being committed.
- cleaning Jupyter notebooks, which means removing all outputs, execution counts, Python kernels, and, for Google Colaboratory (Colab), stripping out user information.
We have configured pre-commit
to run automatically on every commit. By running on
each commit, we ensure that pre-commit
will be able to detect all contraventions and
keep our repository in a healthy state.
No pre-commit hooks will be run on Google Colab notebooks pushed directly to GitHub.
For security reasons, it is recommended that you manually download your notebook, and
commit up locally to ensure pre-commit hooks are run on your changes.
Installation
In order for pre-commit
to run, action is needed to configure it on your system.
- install the
pre-commit
package into your Python environment fromrequirements.txt
; and - run
pre-commit install
in your terminal to set uppre-commit
to run when code is committed.
Using the detect-secrets
pre-commit hook
The `detect-secrets` package does its best to prevent accidental committing of secrets,
but it may miss things. Instead, focus on good software development practices! See the
[definition of a secret for further
information](#definition-of-a-secret-according-to-detect-secrets).
We use detect-secrets
to check that no secrets are accidentally
committed. This hook requires you to generate a baseline file if one
is not already present within the root directory. To create the baseline file, run the
following at the root of the repository:
detect-secrets scan > .secrets.baseline
Next, audit the baseline that has been generated by running:
detect-secrets audit .secrets.baseline
When you run this command, you'll enter an interactive console. This will present you with a list of high-entropy string and/or anything which could be a secret. It will then ask you to verify whether this is the case. This allows the hook to remember false positives in the future, and alert you to new secrets.
Definition of a "secret" according to detect-secrets
The detect-secrets
documentation, as of January 2021, says it works:
...by running periodic diff outputs against heuristically crafted [regular expression] statements, to identify whether any new secret has been committed.
This means it uses regular expression patterns to scan your code changes for anything
that looks like a secret according to the patterns. By definition, there are only a
limited number of patterns, so the detect-secrets
package cannot detect every
conceivable type of secret.
To understand what types of secrets will be detected, read the detect-secrets
documentation on caveats, and the list of supported plugins. Also, you should use
secret variable names with words that will trip the KeywordDetector plugin; see the
[DENYLIST
variable for the full list of words][detect-secrets-keyword-detector].
If pre-commit
detects secrets during commit
If pre-commit
detects any secrets when you try to create a commit, it will detail
what it found and where to go to check the secret.
If the detected secret is a false positive, there are two options to resolve this, and prevent your commit from being blocked:
- inline allowlisting of false positives (recommended); or
-
updating the
.secrets.baseline
to include the false positives.
In either case, if an actual secret is detected (or a combination of actual secrets and false positives), first remove the actual secret. Then following either of these processes.
Inline allowlisting (recommended)
To exclude a false positive, add a pragma
comment such as:
secret = "Password123" # pragma: allowlist secret
or
# pragma: allowlist nextline secret
secret = "Password123"
If the detected secret is actually a secret (or other sensitive information), remove
the secret and re-commit; there is no need to add any pragma
comments.
If your commit contains a mixture of false positives and actual secrets, remove the
actual secrets first before adding pragma
comments to the false positives.
Updating .secrets.baseline
To exclude a false positive, you can also update the .secrets.baseline
by repeating
the same two commands as in the initial
setup.
During auditing, if the detected secret is actually a secret (or other sensitive
information), remove the secret and re-commit. There is no need to update the
.secrets.baseline
file in this case.
If your commit contains a mixture of false positives and actual secrets, remove the
actual secrets first before updating and auditing the .secrets.baseline
file.
Keeping specific Jupyter notebook outputs
It may be necessary or useful to keep certain output cells of a Jupyter notebook, for
example charts or graphs visualising some set of data. To do this, according to the
documentation for the nbstripout
package, either:
- add a
keep_output
tag to the desired cell; or - add
"keep_output": true
to the desired cell's metadata.
You can access cell tags or metadata in Jupyter by enabling the "Tags" or "Edit Metadata" toolbar (View > Cell Toolbar > Tags; View > Cell Toolbar > Edit Metadata).
For the tags approach, enter keep_output
in the text field for each desired cell, and
press the "Add tag" button. For the metadata approach, press the "Edit Metadata" button
on each desired cell, and edit the metadata to look like this:
{
"keep_output": true
}
This will tell the hook not to strip the resulting output of the desired cell(s), allowing the output(s) to be committed.
Currently (March 2020) there is no way to add tags and/or metadata to Google Colab
notebooks.
It's strongly suggested that you download the Colab as a .ipynb file, and edit tags
and/or metadata using Jupyter before committing the code if you want to keep some
outputs.