Lauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes, ensuring those platforms maintain high data quality, and different roles in data science for engineers and subject matter experts. (Read Part 1 of our interview here.)
Data Culpa: One of the things we’ve talked about with Michal Klos, is that there are data scientists who have domain expertise who say, “The data has changed, and we think it’s because of this,” where “this” is something having to with their domain. And then there’s the data engineering team that’s actually building the pipelines, using Jupyter, Pandas, and tools like that. These two teams have different types of expertise. The data engineering team might have been given instructions on what to build, and once the pipeline is up and running they say, “We built it. Mission accomplished.” But they don’t necessarily have the expertise to be able to say, “That thing we built is now delivering skewed results.”
Lauren Moores: Correct.
Data Culpa: Is that a challenge you’ve seen at a lot of places?
Lauren Moores: Absolutely. And that’s a case where you need a direct integration with the data engineering team to understand how you might be able to scale that gray area between the two types of expertise. In former companies, we’ve had a layer of business intelligence experts who played a role in between data engineering and end users. But that’s not always how it’s set up. It could be a data analytics function. You can have that analytics team responsible for that middle layer, and it’s their job to understand what’s being used in the different business applications, whether it’s straight from Snowflake or AWS, or it’s going through one of the BI tools and then being used by somebody else for modeling. They are not necessarily subject matter experts, but they can trace a problem and fix it and work with the engineers to ensure that those types of errors don’t happen again.
Different Roles in Data Science for Engineers and Subject Matter Experts
You really do have to rely on the subject matter experts to ultimately solve these problems, though. Otherwise you’d have to build a team that was just way too big. So, if you have the relationship with the data science team where they’re reporting back anything that doesn’t make sense, and you’re talking about it together, and you’re bringing that back into all the people who are building and managing the data system, then hopefully we’re able to prevent bad things from happening or quickly fixing things from when they do occur.
At a former company, we relied on the subject matter experts to define what KPIs should look like or what outcomes should look like, because then they would know if they were working or not. Then, if things didn’t make sense, those experts would come back to us in the data team to look at the data to see if something had changed or was incorrect, or if the true behavior had changed in a way they hadn’t expected and now needed to be taken into account in revising the KPIs.
Data Culpa: And by true behavior, do you mean, there’s a legitimate reason for the variation that we’re seeing, and so we’re going to adjust our KPI and our model—
Lauren Moores: Yeah, exactly.
Getting Teams to Work Together
Data Culpa: And when you talk about different teams working together, is that just people exchanging Slack messages and comparing notes on data? How does that coordination actually take place today?
Lauren Moores: I think it depends on the organization. I do believe there should be that one person who people go to who knows where all the data is. And it’s a big job. But if they have the right setup, they can be that person who says, “You know, you just asked about this particular question. I’m going to go conduit that to so-and-so.”
“There should be that one person who people go to who knows where all the data is.”
Knowing what the issues are will help you understand if there’s something really much bigger going on. There could be a global problem that’s happening that nobody will recognize, because ten different people answered the question and fixed it, but didn’t realize that there was an underlying issue happening. So, it’s really about creating the atmosphere and adopting as a foundation that data matters. We have a team that is the liaison between all the different groups. Or sometimes you have a data advisory or a data governance board where you’re able to talk about some of the things that are happening across the company, particularly when it’s a much bigger company.
In cases like that, there’s something like six of you. It’s much easier to just figure it out. In parallel with that, what I found is that not only do you have to understand the data, but you have to understand what’s the data being used for and why do you have it. And put it in context to the overall strategy and products that you’re building because, otherwise, you don’t have any context. You don’t necessarily know if something is working the right way, because you don’t know what you’re trying to do with it.
Data Culpa: Right. And part of that could be the result of people just saying like, “Hey, we need to have all this data” or “Look, we’re collecting all this data.” But it’s more than just having data. It’s having a use case or a goal.
Lauren Moores: Use cases matter. And use cases are actually something that evolves over time. You might start off and you have a very open structure, and anybody has access to the data, because usually it’s just the tech side or the data science side that’s interested in the data. And then that changes, and you start opening up to end users, but you really need to think about what do they have access to. Do they really need all access to all that data? What are they really trying to do? What are the decisions that you need to make? I totally believe in having different permission sets for data.
“Uses cases matter. . . . They are something that evolves over time.”
Data Access Permissions for Different Roles
You know, Michal and I would talk about shutting everything except for his team and this other team, because end users should not be accessing the raw data. They can get it through a visualization platform. And that allows us to know exactly how people are using things.
Otherwise, what usually happens, particularly on the business side–and this is a huge data quality issue—you’ve got people accessing things that haven’t been standardized or defined or they’re dumping things into an Excel spreadsheet and they’re making changes, and then they’re passing it around and everybody uses that for six months even though the underlaying data has changed. That’s a huge problem. It can cause lots of angst and chaos, because you’re not accessing the true foundation anymore, right?
Data Culpa: Right. Somebody has taken some snapshot. For example, somebody has a friend in IT. They go pull data, they put it in a spreadsheet, and they start building models off that. Meanwhile, the data engineering team has actually come up with a more accurate model of what the truth is.
Lauren Moores: Or the data has changed. Particularly on the business side, the KPIs are changing, and then you realize that the marketing material that is being created is based on something that is three months old. You haven’t set up the systems, or you have people who are just used to saying, “Hey, I’m going to put this in an Excel spreadsheet. Oh, and I might even put it up on the platform that nobody knows that I have.” And then if people are using it it’s like, “Where did that come from?” That creates so much angst if you think about it, because if you’ve got business people making decisions off of something that’s incorrect, while you’ve got other people actually doing the work, whether it’s the sales side or it’s the data side, and decisions are being made on something that’s not right, that’s very frustrating. But it happens all the time.
“You have people who are just used to saying, ‘Hey, I’m going to put this in an Excel spreadsheet. Oh, and I might even put it up on the platform that nobody knows that I have.'”
Data Culpa: This is like rogue IT for data.
Lauren Moores: Yeah, or it’s a permissions thing or it’s not understanding why people are using it. In my last company, we would spend a lot of time asking, “Well, what are you trying to do?” Because there are three ways to do it, and this is the best possible way. And, you know, yes, you can do it the way you’re doing; however, it’s going to get stale or it’s going to be out of date, or it’s siloed from everything else that everybody else is doing in the company.
Understanding Bias and Its Effects on Data Science Results
Data Culpa: So, we have a hunch that eventually data quality leads into another big issue in data. And I don’t know if this is something that has arisen in your work, but AI-like bias in data models and that relationship to data quality. Is that something that has come up in your work?
Lauren Moores: Yeah, I think it’s more than bias. I mean, every dataset is bad, right? Every model has its bias. There’s no such thing as ground truth. There is a ground truth, but even the ground truth you have to understand the nuances of that particular ground truth and what exactly it is confirming or not confirming.
“Every model has its bias.”
You know, in a former company, we used many different datasets in order to understand the different patterns. And what we did is, we evaluated each dataset and understood its biases, which actually helps with data quality. Because you know exactly how it’s going to behave, and if all of a sudden it starts behaving differently, then you have to understand why. But you can take that into consideration when you’re modeling. You can triangulate the data, or you can handle that in a way that you’re handling changes within a particular source to get to the truth.
And so, yes, absolutely: any data you’re using is going to create a biased model. You have to understand what it is that’s being biased. You could select and create your features in a way that is as unbiased as possible. But you also have to be aware of the fact that there’s no model that’s 100% going to mimic true behavior.
The work becomes understanding what that bias is, and then translating the outcomes to the known inputs and the known feature selections, and making sure that people understand all that.
“The work becomes understanding what that bias is . . . and making sure that people understand all that.”
Then you can say here’s where this model is working. Here’s how it’s generalized really well. Here’s where we might have some bias, but we don’t need to address that right now, because it’s not going to change the outcome. Or you could say, we know in this particular case, it’s biased in a certain way because of the data that we have. And we need to make it more accurate, and we need to get this other type of data to add into that or we need to transform the data, or we need to create a smaller usage, or we know that we’ll use this particular model to inform on the next model.
End of Part 2.
Missed Part 1? Here it is.