In this blog, we’ll discuss all things data quality and data observability. We’ll present interviews with leading data scientists, and we’ll discuss our own working building data observability solutions for data engineers and data scientists working in a broad range of industries.

Data Coverage and Other Pillars of Data Science: An Interview with Gordon Wong, Part 2

by | Sep 28, 2020 | Data Culpa, Interview

Gordon Wong has been working with data since the early 90’s. His accomplishments include building Fitbit’s data analytics platform from the ground up and serving as HubSpot’s VP of Business Intelligence. He has also worked or consulted for leading brands such as ezCater, edX, and ZipCar. In all his engagements, Gordon focuses on repeatable processes, continuous insights, and team building. In Part 2 of our interview, we spoke to Gordon about data coverage and other pillars of data science, and the importance of testing for data quality. You can read Part 1 of our interview here.

Data Culpa: Can you tell us a little more about your approach to setting up a new data science project?

Gordon Wong:  Yes. I have five pillars in mind for any data analytics platform. And they are, in order, security, data quality, reliability, user experience, and finally data coverage.

The Five Pillars of Data Science are Security, Data Quality, Reliability, User Experience, and Data Coverage.

I put security first, because you’ve got to protect your customers and yourself. But right after that is data quality. If your data is not reliable and safe, then the results are not reliable and safe. Not only will no one use your system, no one should use your system. We all take data quality for granted, but that’s incredibly dangerous.

I won’t name any names, but I know of a very large company that runs very large Google SEM campaigns–search engine optimization marketing campaigns. They enabled a Google auto-bidder, and in beta it worked great, and they were making more money and better returns on their ad spend and their campaigns. But in production, they started introducing bad data to the algorithm by accident, because they didn’t have the right tests. It went so badly they missed their quarter. And this was not a small company. This is a top 20 buyer of ad spend. Bad quality matters.

I equate data quality tests to smoke detectors. What’s the value of the smoke detector? But it’s a hard sell sometimes when people say, “I don’t have a smoke detector, and nothing’s happened, so I don’t really need one.” They might think that until it’s too late.

“I equate data quality tests to smoke detectors.”

Next is system reliability. Is the warehouse loaded on time? Is there a lag in the data? Reliability means having good answers for questions like these.

User experience, because I think that analytics is an inherently creative endeavor, and we talked about that question of evolution a little bit earlier in our discussion.

When an artist or someone is engaged in a creative process, if they struggle with their tools, it tamps down the creativity. The manual focus isn’t working on your camera. You don’t have the proper lenses. Your brushes are all dried out. As a cook, your knives are dull. It affects everything. So, I believe that if I build a better user experience for users, they will spin the analytics flywheel a little faster so they get more output, and they’ll get better question evolution. Over time, you should get logarithmic effects. If you can answer questions faster and your questions are getting better, that should give us some really good return.

Being Smart about Data Coverage

Data Culpa:  And last pillar was—

Gordon Wong:  Data coverage. What data is in the system? So funny enough, I put that last. And sometimes when I tell users that, people say, ‘Data coverage the last thing? Why?’ I’m like, yeah, because if we are thinking with agility, we want to actually have the minimum data necessary to answer your question, not the maximum amount. And the reason I say that is because there’s so much data.

“I put data coverage last . . . because if we are thinking with agility, we want to actually have the minimum data necessary to answer your question, not the maximum amount.”

I’ll use Fitbit as an example. We had a petabyte of data, and I could say to the users, “Here, give me a year to take care of this. Let me load all of this, and then I’ll start answering questions.” And if they’re smart, they should say, “How about a sample? Because we don’t want to wait a year.”

Data Culpa:  You’re asking, “What are we trying to solve?”, and then saying, “Let’s figure out how much data we need to solve that.”

Gordon Wong:  Right. How about 1% or 1/10 of 1% for a 1% cohort for last week? Then I could say, “We can get that in a couple days.” As opposed to saying, “Let’s wait a year and let me invent some new technology to handle this much data.”

Data Quality: Testing and Monitoring

Data Culpa:  So far, you’ve talked about why data quality is important, and you’ve talked about the idea of testing and taking a very focused approach to inquiries. I assume that when you’re working on a project, you’re building tests for quality, along with tests for reliability and all the other pillars in your approach. Can you talk about what works and what doesn’t work in data tests, given the current state of technology?

Gordon Wong:  Sure. So, it turns out that solution that I build the most for in analytics is still data warehousing. It seems shocking to me that 30 years later, we’re still doing data warehousing, but so far, it’s still a practical solution. The trick to data warehousing has always been about being a reductionist and about doing the basics well. Because the scale will kill you otherwise, if you don’t to the basics well.

And so, when it comes to data quality, I divide the work into two areas, testing and monitoring. And my distinction is testing is, did I get the right outcome? That’s my final quality check. And if I’m only doing one thing, if I only have enough time and resources to do one thing, I test the outcome, the very end.

It doesn’t tell you where something went wrong, but at least tells you whether or not the product is usable for an end-user. And you have to think like a user.

Monitoring is more situational. It’s about state, right? I’m monitoring a pipeline, so now let me check the different phases along the way.

There’s two different audiences for this. Testing is for your product manager and your end user. The product manager feels comfortable giving this to end users, and end users can use it. They feel comfortable using the outcome.

Monitoring is for your developers. It lets you develop more quickly, helps you find a problem, helps you fix a problem. It also helps you learn about your data. So, two different things. And so, for testing, I actually have a very simple system I use that is intentionally limited. It’s very limiting, because it forces you to think this way.

“Testing is for your product manager and your end user. . . . Monitoring is for your developers.”

I do it all in SQL, so it’s all in a database, and basically write tests that are pass-fail. Is this table unique? Does this table have parents for all the children? Is the sum of this column above a certain number or a certain percent, based on something else? But I make these tests to be true-and-false tests.

And the nice thing about that approach is that it forces you to think a little bit about what’s passing and failing, which is important. As a team leader, I know that 90% of the time my engineers don’t have enough context, and I want them to develop more context. They’ll say, “Well, I can’t write a true and false test for this.” And I say, “Well, then you don’t know the data well enough. Why don’t you go talk to the user and try to see if we can come up with what’s a pass-fail for this?” And then it causes them to think.

The Power of Simplicity

But the other great thing about this is, it’s just so simple. It is very, very easy to schedule a bunch of pass-fail tests. It’s very, very easy to store the results to a table, and have a fact table that has–here’s the test ID, here’s the runtime, here’s the results; and then now have an asynchronous store of all my test results. And it’s very easy to take any BI tool in the world–Tableau, Looker, whatever–and build a dashboard off my test table.

So as an analytics person, we’re eating our own dog food. We’re generating analytics data marks for testing. Of course, that runs into limitations at some point. But I guarantee you in the beginning, that will get you somewhere.

“As an analytics person, we’re eating our own dog food. We’re generating analytics data marks for testing.”

And it allows you to do things like oh, I know that this one source occasionally produces bad data. Let me write a test for that. And just toss it into the system and let it be scheduled. And then six months later, the test fails. And you get an alert and you’re going, what was this test again? And you go, oh, I’m a genius because I wrote this thing six months ago that I would never have caught otherwise, right?

Data Culpa:  Right.

Gordon Wong: None of this is AI. None of that is being particularly clever. It’s just all being very, very reductionist. Make it bulletproof. But that doesn’t help you with monitoring.

Data Culpa:  Are you running these tests as a batch process? Because it seems like monitoring, you’re probably running at a different interval than you’re running your end-result quality tests.

Gordon Wong:  Yeah. I’ve had to rebuild the same system multiple times now, a slightly different technology each time, because the commercial products have not been very good. So, I’ll say to my team, okay, again, if we have one table, is it unique? And they’ll say, I don’t know. Let me go write a subquery. And like they write from the right query and they’re like, yes, it is unique. All right, could you take that query and put it into a test? And then, okay, they’ll do that.

And then like, of course, they’re manually writing in a query every once in a while. I’m like, could you find a way to schedule that test, please? And, of course, they’ll do cron at first or something like that. And like at some point, someone will say, oh, we need something that can run SQL. I’m like, that’s a good idea. Why don’t you write something that–either go find something or write something to run SQL. You know, arbitrary SQL. And let’s store those results to a table. So inevitably, you end up building a little bit of software.

And then you use something like Airflow or Jenkins or whatever, to start scheduling and doing orchestration or whatever. And you get into the question of like, oh, time-based initiation versus event-based initiation for the tests. And so, it feels a little bit repetitive for me, because I’m having the same conversation each time with teams if I go to a new place.

But I’m always taking software engineers and trying to turn them into data engineers, and there’s a big, big difference. There’s a huge difference.

Software Engineers vs Data Engineers

Data Culpa:  Yeah. But what to you is that difference? How would you describe the difference between software engineers and data engineers?

Gordon Wong:  I think it goes back to putting the technology in focus versus putting just the problem in focus. I mean, software engineers tend to be trained to solve a particular kind of business problem that can be solved with procedural code. But in the database world, in the analytics world, the focus is really the data set. And what we’re really trying to do is transform datasets into metrics.

“Software engineers tend to be trained to solve a particular kind of business problem that can be solved with procedural code. But in the database world, in the analytics world, the focus is really the data set.”

And so that whole pipeline-method thinking, that whole data-set thinking is not natural for a lot of software engineers. It takes some transition time. And they don’t realize that – more in this field than many other fields – with everything you build, you have to maintain. Again, it’s a continuous process.

So, a furniture maker might build a chair and offer you a warranty, but they may never see that customer again. Whereas a farmer has to continuously deliver product and has to maintain that relationship. And so, we’re much more like farmers or people in that kind of continuous delivery mode, as opposed to a one-and-done kind of situation.

Data Culpa:  Right. The software engineer is more like the furniture maker who says “You asked me to build some code; I built some code. Here’s the code.” Whereas data engineers will say, “We’re working with this data, and the business side of the house is going to ask different questions all the time and have needs to ask different questions all the time. That’s our job.”

Gordon Wong:  That’s right, that’s right. Yeah, and I mentioned I’m from a restaurant family. And I actually use the restaurant industry quite a bit as metaphors when working with teams and building processes. Because I think that restaurants are the single best and most accessible example of good process. No restaurant can be successful without good process, whether it be The French Laundry or McDonald’s. You know, look at McDonald’s. They are not successful on the basis of their recipes. You know, they don’t have the best burgers. They are successful based on their process.

End of Part 2.

Read Part 1 here.

Learn about Data Culpa’s solution for data quality intelligence.

Have Questions?

Feel free to reach out to us!


Subscribe to the Data Culpa Newsletter