**Warning This document has not been updated for a while now. It may be out of date.**

# govuk-content-similarity: Document Similarity: SOMs

- Avision Ho
- 20th October 2020

# Introduction

## Aims

- Create rudimentary document vectors using doc2vec for experimenting with SOMs.
- Applies Self-Organising Maps (SOMs) to identify and cluster semantically similar GOV.UK content.
- Visualises these clusters interactively using bokeh.

## What are SOMs?

SOMs is primarily a dimensionality-reduction algorithm that takes an input space and represents it in a lower-dimensional space. It is an artificial neural network that is trained using unsupervised learning.

Unlike other artificial neural networks, they apply competitive learning instead of error-correction learning (such as backpropagation with gradient descent). They use a neighbourhood function to preserve the topological properties of the input space.

The algorithm works as follows:

- A set of document vectors,
`k`

, are created; this is your training data. - A set of
`mxn`

nodes with their weights are initialised.- Note,
`m * n < k`

. - A rule of thumb for determining
`m`

and`n`

is that they should approximately be the ratio of the two largest eigenvalues of the training data’s covariance matrix.

- Note,
- A document vector is chosen randomly from the training data.
- Every node is examined to compute which one's weights are closest to the randomly chosen document vector in the previous step.
- The node that is closest to the randomly chosen document vector, the winning node, is commonly known as the
**Best Matching Unit (BMU)**. - The measure of distance to define
*'closeness'*is typically Euclidean distance though for some implementations of SOMs, this can be configured to take other measures like cosine-similarity or manhattan distance.

- The node that is closest to the randomly chosen document vector, the winning node, is commonly known as the
- The neighbourhood of the BMU is computed. The amount of neighbours decreases over time.
- The weight associated to the BMU becomes more like the chosen document vector; where the closer a node is to the BMU, the more its weights get altered and the further away the neighbour is from the BMU, the less it gets altered.
- Repeat steps 2-5 for
`k`

iterations.

## Sounds like PCA...

Can be considered a non-linear generalisation of PCA, with many advantages over conventional feature extraction methods such as PCA.

## What's the catch?

Drawbacks of SOMs are:

- It does not build a generative model for your data so it does not understand how the data is created. This means it does consider the distribution of your data so cannot tell you how likely a given example is.
- Time preparing the model is slow so hard to train against evolving data.
- Does not work well with categorical data and even less so for mixed types data.

# Let's apply it!

## Data load and preprocessing

`notebooks/soms/01_load_clean_.py`

We download the pre-processed content store of all the pages on GOV.UK from AWS on store it in our data directory.

Is approximately over 530,000 GOV.UK pages.

Then remove non-alphabetic characters, NAs and duplicates from the GOV.UK text.

To remove possible noise further from our text data, we also lemmatise the text and remove stopwords.

## Document vector creation

`notebooks/soms/02_embedding.py`

We use doc2vec to create the document vectors. The decision to use this technique was to quickly get some vectors representing documents set-up so that we can test out SOMs.

There are probably better alternative ways to generate these vectors, especially given there are literature concerns over doc2vec. A few are provided below:

- Universal Sentence Encoder sentence embeddings and then a form of averaging of these. This is actually done in a parallel workstream by @whojammyflip

## Applying SOMs

`notebooks/soms/03_som_visualise_minisom.py`

We use the minisom implementation of SOMs for its ease-of-use, continual maintenance and comprehensive examples.

We apply the SOM via:

- Taking the ~500,000 document vectors with 300 features and normalise them for better SOM performance.
- Initialising a 1 x 839 set of nodes, 2D SOM, with randomly generated weights.
- Training a SOM on this using cosine-similarity as our distance metric to obtain the BMU.

After training the SOM, have mapped these ~500,000 samples into (1*839=)839 nodes. In this way, Swe have mapped from X to the reduced space, Y, where X is of ~500,000 and Y is of 839.

We then compute the coordinates of the winning neuron, BMU, from the set of 1 x 839 nodes, and convert these coordinates into an index which act as our clusters. This gives us a `cluster_index`

for each document vector which we can then column-bind to our each `base_path`

or `text`

so we know what cluster each GOV.UK page belongs to.

[*stackoverflow, ASantosRibeiro 2014*]

Thinking now is that we have identified 100 clusters from our 294 node mapping of our 3462 samples dataset of document vectors.

- Want to then visualise these 100 clusters and possibly have a drill-through (stretch goal) or extract list of each document vector relating to each category.

## Plotting

`notebooks/soms/04_bokeh.ipynb`

`notebooks/soms/04_matplotlib.ipynb`

Due to SOMs being a dimensionality-reduction technique, they can be used to create clusters for identifying semantically-similar content. This can then be visualised by a U-Matrix of the SOM, which illustrates the Euclidean distance between weight vectors of neighbouring cells.

In terms of interpreting a U-Matrix, a good explanation is provided here. In particular, the colours on the U-Matrix represents how close the neurons in the Y map are, with lighter colours representing smaller distances between neurons and darker colours representing larger distances between neurons. In this way, the light shades of a U-Matrix signify clusters of similar neurons whereas darker shades are cluster boundaries. These are clusters of the SOM nodes.

In the context of the plot we created in these notebooks, black dots represent sets of contents within each hexagon.

The content in the purple hexagons that's displayed when hovering over a interactive plot in `04_bokeh.ipynb`

are identified as being semantically similar to the content within other purple hexagons in its neighbourhood.

The content in the yellow hexagons are identified as being semantically dis-similar to content within other yellow and purple hexagons in its neighbourhood.

Therefore, the content in the purple hexagons represent semantically-similar GOV.UK content pages. These can be used in a product with a UI that allows the user to filter for subsets of pages and then seeing the hexagons. They can then explore further by clicking-through on the page link, `base_path`

. In this way, content designers can use their expertise to understand if the pages being identified as semantically-similar have overlapping intents and thereby they can make a decision on whether to start the process for requesting one of these pages to be taken offline, so as to not confuse users.