Understanding Data Mutability: The Dual Nature of Event and Entity Data

25.10.23 • 3 min read

There are only two types of models in a data warehouse: event data, and entity data. We will dive into this crucial distinction, its implications on data change management, and the best practices we employ at Tasman.

Delineating Between Event and Entity Data

Broadly speaking, the vast ocean of data within a data warehouse can be categorised into two distinct types:

1. Event Data

This form of data charts an event or occurrence pegged to a specific point in time. Classic examples encompass user interactions, transactions, and system logs. Owing to their very nature, such events are immutable. Once recorded, they stand unaltered, offering a reliable historical record.

2. Entity Data

Diverging from event data, entity data pertains to distinct items or subjects. Customers, accounts, and orders all fall under this umbrella. These are intrinsically mutable, capable of changing unpredictably over time.

In simple terms, if you consider an action like account_created, the account embodies the entity, while created signifies the event. And if we’re venturing into the realm of dimensions and facts: dimensions echo the characteristics of entity data, whereas facts could either be event data (like individual transactions) or entity data (such as aggregated statistics over a particular span).

Mutability: A Core Concept in Data Modelling

It’s paramount to conceptualise mutability, separating it from the constraints of any physical data representation:

Events chronicle occurrences frozen at a precise moment in time. Post their occurrence, they remain as-is, and the only progression lies in recording subsequent events. Such event data can be processed once and left undisturbed. In essence, events are conceptually immutable.
Entities, however, are fluid. They embody objects or subjects whose attributes can be modified. For instance, a customer might update their contact information, or an order might be revised. This malleability of entities gives rise to what we term ‘slowly changing dimensions’.

Why Is It Crucial To Manage Data Change?

Should you neglect an effective strategy for overseeing these mutable dimensions, the repercussions are manifold:

Reporting becomes a minefield of inaccuracies.
Valuable historical data and its associated context dissipate.
The breadth and depth of data processing get restricted.

Illustrative Case Studies

1. RevenueCat’s Schema Conundrum

RevenueCat, a frontrunner in subscription management, dispatches extracts of subscription details. The initial extract showcases a complete history, with subsequent extracts revealing new and updated subscriptions. However, prior to their commendable V4 schema update, the absence of an update timestamp rendered consistent data accuracy near impossible. This pitfall was evident when even RevenueCat’s own documented queries faltered.

2. Client Backend Data Woes

Collaborating with a client revealed the complications of untracked backend data modifications. The lack of a mechanism to monitor row updates necessitated replicating the entire backend for any data extraction. Given the volume and associated costs, such operations could only be executed weekly—a sub-optimal arrangement.

Tasman’s Approach to Mutable Dimensions

At Tasman, our modus operandi leans heavily towards type 2 slowly changing dimensions. This allows us to maintain a comprehensive historical record of entity changes. Armed with an entity key paired with an updated timestamp, it’s straightforward to craft valid time intervals for any entity. Such intervals not only facilitate testing but also ensure accuracy in data joins and simplify the task of retrieving the latest data.

Final Thoughts

Understanding the dichotomy of event and entity data is the cornerstone of effective data management. As providers and handlers of data, recognising the impact of change tracking and mutability on data modelling is non-negotiable. In our ever-shifting data landscape, embracing these foundational concepts equips us to harness the true potential of our data reservoirs.