Data quality is an ongoing challenge for data scientists and data engineers. In this interview, Michal Klos share the insights he’s gleaned from a long career building data pipelines and ensuring they deliver high-quality results. Michal is Sr. Director of Engineering at Indigo Ag, where he is helping the fight against climate change. Michal has been leading data engineering and platform teams for nearly two decades. Before joining Indigo Ag, he built and scaled massive data platforms at multiple companies, including Localytics, Compete, and WB Games.
Data Culpa: Tell us about your background and experience in data science.
Michal Klos: I started my career doing retail analytics. I was never officially a data scientist. I called myself a “proto-data scientist.” We had a ton of customer data and transactional data for major retailers like Toys”R”Us, Babies”R”Us, PetSmart, Lord & Taylor, and many other national retails brands. I spent the first five years of my career hacking at that data for analytics. That was really the basis of a lot of my experience as far as data analytics and data science goes.
Then as my career progressed, I was much more attracted to the data engineering and the platform work, so I moved more towards working on the backend. At this point in my career, I wasn’t necessarily doing the analytics or the analysis itself. I was working to create platforms that allow other people to do that work.
“As my career progressed, I was much more attracted to the data engineering and the platform work.”
From there, I worked at Compete.com, which was a massive internet research company. We had a panel of two million people who clicked. We stored every single click they made, and we basically constructed a model of the internet based on that data.
After that, I worked at Warner Bros. Games, where I built a platform for doing game analytics. After that I went to Hadapt , which was a database startup that was acquired by Teradata and is now spiritually a new startup, Starburst Data.
And after that, I went to Localytics, which was really my experience of really big data, and really fast data. We had terabytes of data coming each day from 3.5 billion mobile devices worldwide, we had petabytes and petabytes of data and we were serving real-time analytics to our customers. That was really great experience.
And then here at Indigo, we have a pretty large data lake that is supporting machine learning and back-office analytics.
Data Engineering vs. Data Analytics
Data Culpa: What is that you like so much about data engineering as opposed to data analytics?
Michal Klos: One thing to understand about data science and analytics is that actually a lot of the work is manual. When you’re working on discovery and trying to figure out how to solve a problem, the work is really interesting, and you’re kind of hacking around, you’re trying different things, you’re trying to find how to solve this particular problem, how to come up with a meaningful analysis.
“One thing to understand about data science and analytics is that actually a lot of the work is manual.”
But then there’s the second phase of the project where you’re trying to put it into production. And this is often where the engineering team is responsible for making sure that that machine learning model, those analytics are repeatable and they are reliable, and that the quality can be maintained. So, at some point, in a lot of these projects there’s a handoff. And I really like the idea of scaling the analysis that I’ve done.
So, for example, at my first company, Harte Hanks, I did a whole bunch of analytics for our primary client. But then I kept thinking, “Well, how do we get the same analytics for the rest of our clients?” Because, from one retailer to another, you’re still interested in average basket size, average transaction size, churn, how are their marketing campaigns working, what’s the ROI, what’s the lift?
All those analytics are transferable, but you have to get into a second problem which is, how do you make a generic platform that’s reliable and that can take in this different type of data and still produce a good result?
I really became attracted to that problem. That’s why I kind of started moving more towards the backend.
Data Quality Challenges
Data Culpa: In that work of making something repeatable and scalable, obviously data quality comes up. Were you seeing data quality problems even back in those days?
Michal Klos: Absolutely. The only constant in the world is change, and code is changing all the time. I read a study that showed that 99% of production issues were due to code changes. But here’s the thing: even without any code changes on your part, if you’re taking in data from somewhere outside of your system, there are going to be changes by others, and you can’t control the all of the variables that can lead to problems.
“The only constant in the world is change, and code is changing all the time.”
Even if you’re not making any code changes in your platform, error can still be introduced, and that can be really hard to track down and fix. There are also external changes outside of your control that are happening. And the combination of those two factors means that the data quality is always in danger.
Data Culpa: How are you addressing data quality in your current job?
Michal Klos: Like many companies, we have some home-grown tooling for monitoring data quality. I believe that’s how most people are solving the problem. They’re solving the problem through introducing checks and validations in their code, solving through metric monitoring tools. So, there’s many ways to solve for data quality, but either people are not doing it, or they’ve stitched their own way to do it.
Data Culpa: Do you find any downsides to stitching together your own data quality solution? Do you find that it takes a lot of time and pull you away from the other data engineering stuff you want to do? Is data quality a pain, and if it is a pain, how big a pain is it?
Michal Klos: Well, especially for like someone like myself, I’ve been working on data platforms that produce analytics for my entire career. So having to solve the problem at every single place that I’ve been is really kind of annoying.
And the thing is even if you were working on one platform for one decade or two decades, at some point you need to refactor the code, you need to refactor the platform, and you need to make it scale. And that may mean that how you’re doing data quality also needs to change.
I think there’s a lot of value in the idea that there can be a third-party provider that can solve this problem continuously. One of the things I look at is, in the space of application monitoring, not only are there many mature vendors, but there are mature open source frameworks as well. In my experience I really like the ease of just plugging in and playing with a vendor as long as it’s not cost-prohibitive.
“I think there’s a lot of value in the idea that there can be a third-party provider that can solve this problem continuously.”
There’s a large class of problems every business has to solve, but you really don’t want to take the time to try to solve them yourself, because it would distract you from your core objectives. These are critical problems. They’re also ugly, time-consuming problems. And data quality is one of these problems.
Data Quality Case Studies
Data Culpa: Do you have any data quality horror stories you’d like to share?
Michal Klos: I think the worst problems are the ones that you don’t realize are happening, because there’s any number of issues that are easy to alert on. I mean, a lot of platform teams are trying to check for schema violations, because somewhere along the way in your pipeline, you’re probably expecting certain data elements to be there or not be there. I think with the prevalence of semi-structured data those violations are becoming harder to detect, though, because schemas are now wide open.
But the really, really hard problems are when the data has changed, but the data still fits the schema. And sometimes the change might seem really small yet critical.
Here’s an example I can pull from my past. I remember one case where we had a foreign exchange feed that we were multiplying quantities from for our billing using this feed. It turned out that the feed broke in a sneaky way, and we were unaware. It was no longer updating, but everything seemed to be fine. The billing was still coming through; it’s just that the exchange rates were really stale at that point. It’s one of these nefarious things, because the exchange rates really don’t fluctuate that dramatically most of the time, but over time they do drift and next thing you know, you may have underbilled or overbilled by a couple million dollars.
“It turned out the feed broke in a sneaky way . . . and the next thing you know, you may have underbilled or overbilled by a couple million dollars.”
Another problem I’ve been bitten by multiple times is the shape of the data changing, and it’s because of an issue that you’re unaware of or an issue with one of your data providers.
In these situations, it’s really hard to know if the data is legitimately drifting, or if it’s changing in really subtle ways in error – either way this can add up to be a major problem. If the data stops flowing at all or the schema changes, those problems are traditionally much easier to detect with validations. But problems within the data itself are really difficult to detect, especially if that data is coming from an external data feed or some other third party outside of your control.
Data Quality and Third-Party Data Sources
Some companies don’t use third-party data providers at all – they’re in full control of the data in their pipelines. But using outside data sources is very common. For example, a lot of companies pull in data from Salesforce or HubSpot. In my current job, we’re integrating over 100 data sources. A lot of our data is provided by third parties.
A lot of companies are now trying to focus more on machine learning, and for machine learning you need to feed it data, whether let’s just say it’s retail location data, weather data, or financial market data, and typically that data comes from third-party sources. So, from my experience, most companies today are ingesting a lot of data from external sources. And that creates new risks for data quality errors.
The bottom line is, data quality is a problem that we all need to address if we’re going to make our platforms scale and our investments in data science pay off.
End of Part 1. Read more of our interview with Michal here.