When we founded Data Culpa, one of the first claims we made about the data engineering market was that data monitoring would tie into the testing of data. We believed then — and we still believe — that companies would treat data as a product in the same way they treat code as a product.
Recently, I went looking in a variety of product management books to find a good definition of what a product is, but nothing was satisfactory. A product delineated against a human-powered service (consulting) is somewhat obvious: a service is one-off, and a product is something that can be manufactured repeatedly.
So when we talk about data products, we mean, among other things, that the data can be generated in a repeatable way and that data fits a specification of some sort.
Making data fit a specification is fundamental to computing: File systems do it, the contents of files do it, and hardware architecture has to do it, too. I keep thinking about the Application Binary Interface (ABI) layer when I think about data alignment: of course nothing works when things don’t line up, just as we wouldn’t expect software to work when changing the ABI.
Data Contracts and Beyond
“Everyone” is talking about data contracts, by which I mean a vocal group on LinkedIn is talking about data contracts. There are two main properties of a data contract:
- The data contract is a specification of the schema.
- The producer and the consumer agree on the contract.
The problem that most customers deal with is that data is being consumed by data teams without the producer’s knowledge. For example, a customer-facing application generates some “exhaust” that the BI team is pulling into their analytics. The app team hasn’t necessarily signed off on their exhaust being used, but the BI team is hyper-aware when the app team makes changes. The high level human problem is that the app team and BI team haven’t agreed on anything and to do so would require meetings, time, effort, and both sides may view this as a path towards diminishing returns.
So people are discussing implementing data contracts by various means, such as agreeing upon a schema specification that is documented somewhere, maybe in a GitHub repository.
But if it were that easy, everyone would have done it already.
Which brings me to the concept of data context.
Data Context Explained
In Data Culpa Validator, our data monitoring platform, we have a concept of a data context. The idea was inspired by social media platforms that let you view your data from another user’s perspective–e.g., what does my old college roommate or my employer see when they look at my profile? With data contexts, we want to be able to answer the question, “What does the BI team care about when they pull in this data?”
The key thing is that most data models we have seen in the field have around 200 columns. I have many hypotheses about why that is, but that’s for another blog post. However, not all consumers of a data model care about every field; we often have customers tell us they don’t want alarms on certain fields.
So we invented this idea of data contexts to enable both a “who cares” about certain fields from an alerting perspective. So, during monitoring, if something goes wonky with a set of fields that matter to a specific group, Slack alerts can be routed differently than general issues, for example.
Naturally, we don’t want to make people update this information all the time by hand. So we added an API that enables consumers picking up a data feed to log the access with Data Culpa. This means we can track usage statistics per field and per client application. At a glance you can see who is using which fields and thus know the impact if you change or drop a field.
Getting Data Contracts Right with Self-Adapting Contracts
What we’re talking about is self-adapting data contracts. The big objection to data contracts is that they will slow things down. If your implementation has a way to automatically update, you can keep working at high velocity with high confidence. Why? Because the monitoring system becomes the source of truth for determining which consumers are using which fields from which data sets and at which points in time.
None of this requires that you be using data in a specific data warehouse or database product. We offer this functionality even for an S3 share of Parquet or CSV files. All these data sources, as others, are asynchronously connected and tracked with Data Culpa Validator.
I believe a system like this reduces the overhead of a data contract down to a data handshake. And unlike a metadata-only approach, this system gives you monitoring with the handshake.
We are looking for beta customers to try out our implementation of this handshaking. If you’re interested in a self-hosted Docker container for this, send us a note at firstname.lastname@example.org