ABOUT THIS BLOG

In this blog, we’ll discuss all things data quality and data observability. We’ll present interviews with leading data scientists, and we’ll discuss our own working building data observability solutions for data engineers and data scientists working in a broad range of industries.

Where Most Data Observability Solutions Fall Short

by | Sep 29, 2021 | Data Culpa

Observability is the analysis of a system based on its outputs. By analyzing the outputs of a system at various points, it should be possible to infer the internal state of the system and to diagnose problems the system is experiencing.

This sounds useful for data science and engineering, and it is. Data systems working with large volumes of complex data are obviously difficult to monitor and troubleshoot. A growing number of new software products promise to deliver “data observability” for pipelines and other data systems, so that data engineers and data scientists can discover and troubleshoot problems more quickly. Understandably, there’s a lot of interest in these products.

But there are two shortcomings with this model of data observability: lack of support for external data, and lack of support for historical comparisons.

External Data Sources

The first problem with data observability tools is that today’s data pipelines increasingly rely on external data sources. If external data sources change in unexpected ways, they can corrupt a data pipeline’s results. Measuring outputs of the pipeline at various points will indicate there’s a problem, but it won’t not pinpoint the source.

This isn’t a trivial point. Monitoring external data is different from monitoring internal data. Chances are, someone in your organization is in full control of the schemas for your internal data. A business partner or other third party may have provided a description of their data schema, but chances are you’re not in as close contact with that organization’s data team as you are your own. And chances are, you don’t have the time or budget to write countless unit tests to check for changes in data values and data types in every field of every schema from every external data source.

So you need an automated way of detecting changes in external data — a way that analyzes that data’s schema and values over time and automatically builds a model for that data you can rely on.

This type of modeling lies outside the scope of traditional observability, which is focused solely on outputs from internal processes designed and controlled by the organization itself.

Comparisons Over Time

The second shortcoming is that there are lots of reasons why data outputs might change, especially in the world of Big Data. Sales data, for example, might change because of seasonality, sales promotions, breaking news stories, compliance deadlines, or even weather events that disrupt supply chains or trigger changes in consumer behavior.

To analyze these changes, you need to understand context. Context requires understanding changes over time. Is a surge in data volumes the last week of March a cause for alarm? Or is it tied to economic activity at the end of the quarter? To answer that question, it’s useful to compare that week’s data to data at the end of March a year ago. Or data at the end of December. Or even data at the end of February.

In other words, observability needs to take into account a system’s history. It needs to support historical comparisons, so that, for example, a data team can easily compare one day’s data to another, or one hour’s data to another.

This support for historical comparison needs to be built into the observability solution from the ground up. You can’t just tack something like this on and hope to have any effective way of comparing one massive data set to another. You’ll end up with hugely inefficient tools and processes. You need to think about this problem from the beginning.

Data Culpa Validator Gets Observability Right

That’s what we’ve done at Data Culpa. We’ve designed our data observability platform, Data Culpa Validator, to take into account the concept of time. We make it easy for data teams to compare one batch of data to another, one data stream to that stream from an earlier time period, and one data set to a data set selected by the data team as the “gold standard” for the application.

You need to understand what’s happening with your data. That means understanding that data wherever it comes from — an external data feed or internal data source. It also means understanding how much it varies from what you’re expecting, given the full context of your data application or use case.

We make those comparisons quickly, easy, and powerful with Validator.

Try Validator today, and see for yourself.

 

Have Questions?

Feel free to reach out to us!

NEWSLETTER SIGN UP

Subscribe to the Data Culpa Newsletter