content-data-api: Track attribute changes through time for basepaths
The Data Warehouse tracks changes to Content Items through time so, for example, if we change the organisation attached to a Content Item, we should know WHEN that change happened and WHAT changed.
Tracking changes not only applies to other entities associated to the Dimension
Items, but also to core attributes like the
rendering-app, or any other attribute. This is the foundation of our Data
Warehouse: track changes and track performance at the same time.
In GOV.UK we can identify a Content Item on a unique way by their
It is the combination of a
content_id and a
locale what makes a Content
Item unique. So on our first iteration of the DW, when we decided to use the
Content Item as the
grain, hoping to be able to track all our changes by
Shortly after we noticed that most of our metrics are not related to a Content
Item but the
base_path of the item; the is due to some Content Items like
Guides or Travel Advice, has multiple
base_paths and we need to track metrics
at that level (sub item). This was a business requirement driven by user
So we decided to reduce the scope of the
grain to the
base_path of the
Content Item, which simplified and fulfilled our user needs. It was accepted by
the business that if the
base_path changes, then we won't be able to track
the series backwards. We decided to postpone dealing with this issue when we
had more information about it.
As of today, we have learned more about our real needs for querying metrics.
Aggregating metrics through time for a Content Item with a base_path, need to
be done via its
content_id, and there some edge cases to handle to be
accurate. Unfortunately, to address this issues, we need to make our queries
more complex, and the codebase is way more complicated to maintain and reason
The underlying problem is that we don't have a way to track our grain
(base_path) through time because all the attributes can be changed, and
overcoming this limitation using
locales it is just not
worth the effort.
In summary, we would need to provide an efficient way to:
base_path in their latest version, select all the metrics through
time, regardless of the changes to the item.
At the very end, what we are asking for is for a real unique identifier for the
item, because we actually don't have it, because the
content_id is not
playing that role.
Associate a unique identifier to each item that is tracked in the dimensions item.
base_path, which represents the grain of Data Warehouse, will have a
unique identifier that will never change. This ID will allow us to perform
queries through time in an efficient way, and also to be accurate with our
On the same go, all the Contnet Items that share a locale will also be uniquely referenced in the Data Warehouse.
We would not need to add this unique identifier if
The sub-items for Content with multiple paths have a unique UID for each
base_path. Unfortunately, to implement this feature we would need to modify each publishing application; which would impact our immediate delivery, so it needs a wider conversation.
All the content items that have different locales but same ID have a different
It has a low impact in our codebase; we only need to generate it once (on creation), then it will be cloned with each version, so at the very end, a few lines of code with an important impact in our delivery.