Repository: govuk-mongodb-content
Use Docker to run a local MongoDB instance containing GOV.UK content store and inspect with Mongo Express.
- GitHub
- govuk-mongodb-content
- Ownership
- #data-products
- Hosting
- N/A
- Category
- Data science
README
Use Docker to run a local MongoDB instance containing GOV.UK content store and inspect with Mongo Express.
Get Docker
Download and install Docker. If unfamiliar check out their tutorials before proceeding.
Check your Docker works by running the following code in your terminal:
docker run hello-world
Get help with docker with:
docker help
Docker and MongoDB
Prepare data
Download a copy of the production version of the Content Store. This version can be found in the govuk-integration-database-backups
AWS S3 bucket in the mongo-api
folder. The file is called:
{DATETIME}-content_store_production.gz
where {DATETIME}
is a datetime in YYYY-MM-DDTHH:MM:SS
format.
If you do not have access to AWS, ask a fellow GOV.UK data scientist or developer for assistance.
Extract the content_items.bson
file from the content_store_production
folder of the downloaded file by entering the following command in your terminal:
tar -xvf PATH/TO/DATETIME-content_store_production.gz content_store_production/content_items.bson
The database file content_items.bson
is now in content_store_production/
. Move the .bson file into /tmp/mongodb/
(or wherever location you will choose in the docker run
command in the next section). Note that you may have to create this folder.
Your data should now be in a directory that your container will be able to access.
Get the Image and run the container
The first time you run the following code the Image will need to be downloaded:
docker pull mongo
and wait for docker to run the image. This may take a while.
Next run:
docker run --name govuk -d -v /tmp/mongodb:/data/db -p 27017:27017 mongo
where the arguments are:
--name: Name of the container.
-d: Will start the container as a background (daemon) process. Don’t specify this argument to run the container as foreground process.
-v: Attach the /tmp/mongodb volume of the host system to /data/db volume of the container. Note to non-MacOS users, `/tmp` might be RAM rather than disk, so use a different path instead (unless you have copious amounts of RAM). The path can be whatever you like.
-p: Map the host port to the container port.
mongo: Last argument is the name/id of the image. The version can be specified for reproducibility with a colon.
⚠️ If you already have previously run a mongodb image also called govuk
, then you will need to drop it first:
docker rm -f govuk
Use docker ps
to check what containers are running.
You can stop or kill the container at any time - you can restart it with docker start govuk
.
Check for data
Open a bash shell in your recently spun-up govuk container with:
docker exec -it govuk bash
Check that you can see content_items.bson
in the correct directory (i.e. the container can access your local volume specified above and the files therein). You can do so via the normal command lines operations such as ls ...
If you followed the instructions so far, you should see content_items.bson
listed under:
ls data/db
Restore MongoDB from .bson
From the bash shell of the container run:
mongorestore -d content_store -c content_items data/db/content_items.bson
This should start restoring the content_items
collection in the content_store
database on the MongoDB instance.
⚠️ If you already have previously restored a mongodb, then you will need to drop it first, otherwise it won’t get replaced with your newer version:
mongorestore --drop -d content_store -c content_items data/db/content_items.bson
We can access the MongoDB instance from the container with:
mongo 27017
Type help
for help in the mongo shell.
Run some of these commands to test the database is there.
Using toy data instead
If you just want to test this out on some toy data you can copy and paste from here and run in the mongo shell to create a collection to practice on.
Interact with MongoDB instance
Open a new terminal so we can link to it from another container. We download and build a new container from the image mongo-express
which provides a user interface for managing your MongoDB databases.
docker run --link govuk:mongo -p 8081:8081 mongo-express
In your browser, you can look at it for sense check by:
http://0.0.0.0:8081/
It should look something like this:
Leveraging the MongoDB with pymongo
We can connect to and make use of this data for metadata extraction and other uses. In these notebooks we extract the structural network, or those pages that have embedded links on each page.
In this repo, we provide a notebook that demonstrates how to connect to and extract the text from each piece of content for use in NLP applications.