Recently I was talking to a customer who had used a competitor’s product for data monitoring. It sounded like the product had the users specify parameters about the data; e.g., “this column should never be null.” This is all well and good, except that the product then generated thousands of errors—so many errors that the data team had no way to prioritize them.
The product was reporting variations from an absolute standard. There were just two big problems with that approach. First, there were a ton of variations from this absolute standard in practice, resulting in a ton of error messages. And second, it wasn’t not clear that any of these variations mattered. After all, they had likely been occurring before the customer started using the data monitoring product. Was the data team somehow overlooking data performance that spelled catastrophe? Or had their business actually shifted to data performance that didn’t coincide with their original specs?
Why Relative Baselines Are Better Than Absolute Baselines
At Data Culpa, we build a relative baseline on your data over time using our data monitoring platform, Data Culpa Validator. We determine how your data is performing everyday, and we alert you if that changes. To start with, we assume the following:
- Your everyday data performance is basically OK.
- You’re delivering reasonably accurate reports and that your financial transactions are closing without irregularities.
- You would know if your data pipelines were already egregiously corrupt.
- What you’re really looking for in data monitoring, at least most of the time, is consistency. Is today’s data consistent with yesterday’s data or some other reasonable benchmark? If so, good. If not, we’ll let you know.
If your data changes or your standards for data change, Data Culpa Validator adjusts its monitoring accordingly. For example, if you aspire to fix the data such that nulls do not appear, over time Data Culpa will adjust the allowed tolerance from where you are today (say 25% nulls) to where you are in two weeks (15%) to four weeks (5%). If you begin to regress from 5% back to 10%, Data Culpa will let you know automatically.
Of course, we do this for everything, not just nulls–we learn the data flow volume by a few features (day of week, etc), we build models for the expected schema and distributions. All you need to do is configure how tolerant you want to be for the different types of conditions and let Data Culpa handle the rest.
Of course, with our visualizations, you can see your aspirational shortcomings, if you must: we’ll reveal where the nulls, the type changes, the distribution wanderings, are, and you can track them as you work to improve your data products.
But by referring to the relative baseline for alerting, this means that you start at a baseline of no errors, and we go from there. That’s a much easier start to working with a new data monitoring system.