Data engineering teams need to be able to find and fix data quality problems quickly. In part 2 of our interview with Michal Klos, Michal discusses the challenges of discovering and troubleshooting data quality problems in pipelines. Michal is Sr. Director of Engineering at Indigo Ag, where he is helping the fight against climate change. Michal has been leading data engineering and platform teams for nearly two decades. Before joining Indigo Ag, he built and scaled massive data platforms at multiple companies, including Localytics, Compete, and WB Games. You’ll find Part 1 of our interview with Michal here.
Tracking Down the Source of Data Quality Problems
Data Culpa: When you have a data quality problem – such as the problem you described earlier with a foreign currency data feed having stale currency exchange rate data – what’s your process for discovering the problem and fixing it?
Michal Klos: With a lot of these problems, they’re ultimately caught by users who understand what the data should be. So, whether it’s someone in finance or someone who’s an expert on the data from the data science team, you need a domain expert to say, “Hey, that doesn’t look right.”
For example, with that exchange rate problem, it was someone on the finance team who happened to know the current exchange rate for, let’s say Columbia, who spoke up and said, “Hey, this financial statement doesn’t look right for those users.”
That’s the hard part. You’re relying on downstream users to find the problem. The data engineers are often responsible for getting the data into a pipeline, but they don’t have the domain expertise to really know what the data should be. The problem has to go downstream to domain experts, and even those users may or may not notice the problem right away because these are really difficult problems to identify.
“That’s the hard part. You’re relying on downstream users to find the problem.”
Implementing a Machine Learning Model without Domain Expertise
You have to adopt a practice of looking very closely at the data for exceptions when you’re doing that data exploration. For example, let’s say there’s a data engineering team, and a data science team that has built a machine learning model. A lot of times, the machine learning model is implemented by data science engineers on a machine learning engineering team. And in many cases, the machine learning team doesn’t necessarily have domain expertise in the data.
They’re making that model operational, and once the model’s running, they’re considering it a solved problem. But they may not notice that the model’s outcomes are changing slowly over time due to data quality issues. What you really need is for the data science team with the domain expertise to have a reason to go and look at that data again, but as often as not, they may have moved on to something else.
Let’s say you’re trying to think about data quality problems proactively. Assuming you have a framework in place for monitoring the actual values of the data, someone needs to be able to provide in the ranges of what to look for. That turns out to be whole other ordeal. What are the thresholds on which you’re alerting? It’s a hard problem. If you configure alerts based on statistical changes – standard deviations, for example – it may be good to do it without any user interaction, and it’s relatively easy to set up, but it only catches one class of problems.
“If you configure alerts based on statistical changes . . . it’s relatively easy to set up, but it only catches one class of problems.”
Data Culpa: Would a plausible solution be to start off by saying, “Hey, we’re going to do standard deviations to check changes?” That gives us a baseline, and then we can sit down the data science team and confirm that that approach is OK.
Michal Klos: Yes, and then make adjustments as needed. I agree that would be a good first line of defense. If there was a framework that you just turn on, you could have it check for any changes, then over time you can learn what’s okay and what’s not okay.
Data Culpa: And when you have that kind of problem with values changing, I imagine the time to troubleshoot can vary, just like with any software problem. Is this something that can sometimes take days? How hard is it to track down these problems in your experience?
Michal Klos: I think once someone notices that there is a problem, getting to the root cause normally doesn’t take long. I think the hard part is actually the recognition that there is something wrong. If someone says, “There’s something wrong,” you can trace it through and it could take from an hour to maybe two days. But the hard part is just recognizing that there is a problem in the first place.
Why Alerting Is Critical
Data Culpa: It sounds like alerting is what’s really important here.
Michal Klos: Yes, alerting is critical. When it comes to dashboards or alerting, alerting is just so much more valuable, because you might have someone checking a dashboard once in a while, but I think a lot of times when people have dashboards, it’s like a vanity dashboard. When you see the pictures of company offices and they have dashboards all over the place, that’s not for the engineers; that’s for executives walking by. The engineers are relying on alerts.
“When it comes to dashboards or alerting, alerting is just so much more valuable, because you might have someone checking a dashboard once in a while, but a lot of times . . . it’s like a vanity dashboard.”
Data Culpa: So, what you would like is an alert that says, “This data feed from this vendor has—seems to have a problem, either the schema’s changed—”
Michal Klos: Yeah, let’s just get a notification saying, “This value has drifted, is this okay?” Or even, “Hey, this value has actually not changed at all and it historically has,” if a numerical value has never changed, you kind of wonder whether it should be changing. For example, in the case of the foreign exchange table where the data should have been changing. The question then is “Is this acceptable behavior or not for this data?”
That’s where we can call upon the domain expert, who might say, “You know, actually in this case, it is acceptable behavior.” And then the system can learn and then next time, it can basically evolve.
I think something like that would be good. And to be honest, in my career, I’ve never seen anything that sophisticated. The data quality solutions that we’ve cooked up in my roles have been rather crude.
Building Data Quality Tools into a Data Pipeline
Data Culpa: And when you build those tools in-house, what are you building them in, typically?
Michal Klos: It depends. Normally we start within the data pipeline code itself. And that’s the first line of defense, and what I’ve done is publish metrics on the data or check the state of the data. Publishing metrics on the data to a third-party monitoring tool is a solid start. What’s nice about those monitoring tools is that you can actually configure alerts on a time series of the metrics. Because what we’re talking about here often is not a simple statement like, “Hey, this is one individual value OK?” Instead, we’re asking, “What is the trend or what is the shape of the data over time?” That’s why you need to publish metrics and then have an alert or have a tool that can alert on time-series data.
“Publishing metrics to a third-party monitoring tool is a solid start.”
The second line of defense is once it’s landed into your core data repository, whether it’s a data warehouse or data lake, then you can have a scheduled job that every day, runs and just checks on the state of the data, just doing huge batch queries that calculates the statistics of the data, or check for outliers. These are two main ways to look for data quality problems.
For schema problems, that’s mostly done at some point in your pipeline. You probably want to force data into a schema if you can, depending on what kind of data you’re working with. In most places I’ve worked, at some point there’s a structured schema you want to enforce. If there was a violation of the schema, you would error right there, that would be the best-case scenario. But I know there’s so many cases with semi-structured data where you may never force data into a schema in which case until it’s time for “schema-on-read,” you will run into the problem when you’re trying to run analysis or try to use the data. This is bad because you’ve learned about the problem at a very late state.
Data Quality Challenges with Semi-Structured Data
Data Culpa: Does semi-structured data make this harder?
Michal Klos: Yes, because the idea of semi-structured data is really that you can change the schema at will. But in reality, there are probably certain elements that are required in the schema. For example, let’s just say you have a schema that’s tolerant of a value not even being there. Let’s say there’s a field in your semi-structured data that occurs in 5% of the data. And then over time, that field is now occurring 25% of the time. Is that what should be happening? Why is that happening? Data engineers need to be able to answer these questions.
I can imagine a problem like this being also just as difficult as the data shape problem. It’s a variation of the data drift problem, let’s call it the semi-structured schema drift problem. So, let’s just say that field was occurring 5% of the time, but now it drops to 0%. Will you notice? You may, or you may not if your code is too fault tolerant.
As semi-structured data becomes more common, it’s going to be important to be able to address data quality problems like schema drift.
“As semi-structured data becomes more common, it’s going to be important to address data quality problems like schema drift.”