ABOUT THIS BLOG

In this blog, we’ll discuss all things data quality and data observability. We’ll present interviews with leading data scientists, and we’ll discuss our own working building data observability solutions for data engineers and data scientists working in a broad range of industries.

Building Data Science Systems: An Interview with Lauren S. Moores, PhD, Part 1

by | Aug 21, 2020 | Blog, Data Culpa, Interview

Lauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes and ensuring those platforms maintain high data quality.

Data Culpa: In a recent Drift podcast, you described yourself as someone who chases data. What got you into data science? When you were first starting out, what led you down this path to chasing data?

Lauren Moores: I’ve always liked the idea of taking information and making sure people understand how to do things. The way to do that has changed dramatically over my lifetime. The introduction of the internet changed everything having to do with the ability to access people, but also the ability to access data that you never had before.

One of my first jobs was at one of the banks in New York City. I was an intern and I had one of my fellow colleagues who took me under his wing and taught me all the stuff that was happening in their databases, which happened to be SAS databases at that time, and I just started playing with that and with those types of data.

Also, I was an economist. I studied economics and French. And economics is all about numbers and understanding patterns and behaviors.

Buiding Data Science Systems: From Data Collection to Products and Insights

Data Culpa:  And today, when a company brings you in to set up data systems for them, what’s your approach? How do you go about doing what you do?

Lauren Moores:  That’s a good question, because things have changed over time. I used to be brought in to run data operations and then I’d end up owning analytics and data science. More recently, I’m brought in to run data science from the start, but one way or another, it’s the full sciences piece that I do, which means the full breadth of data collection all the way through to data products or insights.

“Data science . . . means the full breadth of data collection all the way through to data products or insights.”

In order to deliver data science, you need to have the right system in place. A lot of people don’t realize that, because they’re focused entirely on data. Most of data science is, “What’s the data you have?” Eighty percent is figuring out what is the data, how do you find it, how do you label it, and how do you actually create some sort of insight or product from it.

Usually I end up coming in at various stages of a startup or a corporate setting, where you start to try to deliver a solution, either internal or external, and you realize that you don’t have the system flow that you need or you don’t have the data coming in that you need. When I come in I have to figure out, alright, what is it that we’re going to need that we should either build or buy off the shelf or work with somebody else in order to deliver the ultimate end solution. And that could be a data-driven solution, or it could be a product that requires the right data to understand where we are with that product.

Building Data Science Systems with Data Quality

Data Culpa:  When you’re putting together those systems, where does data quality fit in? How do you account for it?

Lauren Moores:  That is an excellent question, because as you know, dirty data in, dirty data out. Essentially, at each piece of your funnel, you need to look at where you might fall down and where you might have inaccurate information following through to the next piece.  So  if, say, your licensing information is a raw data feed, you’ve built an API or you’re using their API and you’re bringing it in, you need to make sure that, first of all, the set-up is working correctly and you have some sort of analysis on are you getting those daily files that you expect, or that the columns updating if you’re using columnar data.

“We have to figure out, is today’s data the way we think it’s going to be.”

We have to figure out, is today’s data the way that you think it’s going to be, and there’s different ways to do that. You can either look at the packet or you look at the way that it’s rendering in whatever visualization that you have, or it could be that all of a sudden you’ve got a KPI that just took a dive and has nothing to do with actual behavior. And it’s actually because the data didn’t come in the way you expected it to. So, depending on what you built, you have to make sure that you’re thinking about, you know, is the API working correctly? Is all the data that I expect to get from that person – or even internal, right? It doesn’t matter—it could be third party or internal. Is that coming in correctly? Are my KPIs following in the right direction?

There are other pieces that you can put in either internally into the system or externally when you’ve got somebody looking at the analytics. I mean, hopefully, days don’t go by and you realize, oh my God, there’s a major change in something that we’re doing, and we just made all our decisions off of something incorrect.

Data Culpa:  When you’re building that data system and you’re checking the APIs to catch those types of changes, are you checking in multiple places?

Lauren Moores:  I think you need to check in different places. I don’t think everybody does. A lot of times, it’s a reactive thing where you say, “Well, this doesn’t make sense. What just happened here?” I mean, I can tell you, in the past, I’ve been in a system where we didn’t have the right controls on, and it took us 6 months to find something that affected our predictions. That’s a bad thing, right? A very bad thing. Whereas other times we had already set up a system where we were getting 15 different feeds from a third party, and we were able to check sizes and comparisons to what are expectations. And if it was off, we could tell that we didn’t get a day’s worth of information. So then, the decision would be made, “Okay, are we going to be able to get it and put it back in and rerun models or do we have to create a proxy for that or do we need to smooth over that data?” You have to have different actions depending on what you can do with it.

Another example is, looking at the end outcomes and results. So, if your KPIs take a dive or all of a sudden look really off, like they’ve changed by 30% and you know that you haven’t done anything internally or externally to cause that to happen, then you better go look to see what’s driving that change.

I have a good friend who was in charge of the data science at a company, and their model started outperforming to a level that was unexpected. And she just said, “You know, I know I’m good, but I’m not that great, so let me go look into the underlying data.” And they were able to figure out that the web data that they were using had completely changed, and that was a result of bot activity and the bot systems that were starting to be created in order to fake behavior.

There’s always something that’s going to create some sort of data signal that changes. And it’s just paying attention to that and realizing that, hey, that doesn’t pass the sniff test. But that sniff test is a data test.

Testing for Data Quality

Data Culpa:  For these tests, are you looking at things like data volume and data values? Or are there things you look at besides saying, “Well, we were expecting this to produce five gigabytes of this type of data with values basically in this range, and if it goes outside of that then we’re going to investigate?”

Lauren Moores: I consider those stats to be the backend data engineering piece. Those stats that we should be paying attention to. But then there are the direct layers, the intermediate information that’s happening either in business intelligence layers or by the data scientists. And you need data quality controls there, too. Any time a data scientist or an analyst is looking at the data, they should understand the patterns completely. If the patterns have changed, then they better question whether or not there’s a reason for that. We need to know why that’s changed, because that could be a very important change in consumer behavior that is going to change the business, or it’s a trend that we want to pick up on and change some feature in the product. Or it could be just a complete anomaly, couldn’t it?

“Any time a data scientist or an analyst is looking at the data, they should understand the patterns completely.”

I used to show this an old Adobe advertisement in one of my classes where all of a sudden the guy at headquarters says, “Oh my God, we’re getting all these likes. Increase our supply chain! Increase our volume of product!” And then it shows an infant sitting with an iPad, hitting click, click, click, click, click.

There are two things going on there. One is, it’s bad data. Two, you’re using probably the wrong KPIs to run your business. But these are the things that you need to think about in terms of data quality.

End of Part 1.

Read Part 2 of our interview here and Part 3 here.

Have Questions?

Feel free to reach out to us!

NEWSLETTER SIGN UP

Subscribe to the Data Culpa Newsletter