Lauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes, ensuring those platforms maintain high data quality, and accounting for bias in data analytics. (Read Part 1 of our interview here and Part 2 of our interview here.)
Data Culpa: If you know that a certain data source is biased in a certain way, is there a way to track that bias so you can account for it? Would you build a macro-level system that’s accounting for the bias in specific data sources and ensuring that bias remains within certain bounds? Is that part of data modeling now?
Lauren Moores: Yes. I mean, there’s two ways to do it. You could handle it on the raw data side, and as things change, if they go above certain thresholds, then you’re going to smooth that out before it gets to the model itself. With machine learning, if you’re learning off of data that’s changing, then your model is changing, too, because it’s assuming that the behavior is changing, but if that’s the wrong – if that data is incorrect – then you basically just introduced a huge amount of bias into that. So, again, it’s understanding the outcome because your outcomes are going to start changes, so then you’re going to go back and be like, “Well, is that a true outcome or because of behavior?” Or has the underlying data changed?
Accounting for Bias Ahead of Time
But if you already know a priori that something is biased, then you can make adjustments ahead of time. For instance, I’ve worked with all different data in my career. In one company, we were doing like first appointments, so we were handling company information. I as a salesperson want to understand about that company, what their organization is, where their office is, who do I need to talk to in terms of contacts. And we had one source that they had really good addresses, but they had horrible phone numbers. And we had one company that had really good governance data–you know, the org structure. But they didn’t have as good addresses.
“If you already know a priori that something is biased, then you can make adjustments ahead of time.”
So, it’s understanding those differences and building out your model accordingly. We built it out in the data side where we created a way to take the best of each source and then feed that into our models. Or you can model it such that you’re paying attention to how things are changing and using only certain pieces as it’s coming in. So, you can do either/or, but you need to pay attention to both sides.
The Importance of Data Quality
Data Culpa: So, for you, having worked at all these different companies and with all these teams, when someone says to you, “What are the most important things about data quality,” are there any other guidelines or best practices that you can think of, besides what we’ve already talked about, for people working on these projects?
Lauren Moores: Well, data is never what you think it is. It’s always difficult to get and when you do get it, you better question the hell out of it, right? I mean, you feel that if you have the domain expertise of a particular dataset, then you’re going to know whether or not you’ve got what you need. If it’s something brand new, you don’t know whether it’s going to provide you the value or the signal that you need until you start playing with it.
You know, a lot of people think that in this age of “big data,” that we have everything solved. If anything, that’s not true, because we’ve got good data, but we’ve got data from all different types of technologies or media and the ability to marry all that data is much more difficult. The ability to find the true signal is that much more difficult.
You don’t just want data. You want smart data. You want data that you can make actionable insights or products and create value from and then capture that value. And it’s not just saying, “Hey, I just want everything that somebody typed into a particular form,” or “I want everything, that people are thinking,” because that’s not a value. A lot of times people create these huge systems, have all this information, but they’re using just 2% of it. And all the rest is crap. You don’t need it, because you’re not using it.
“You don’t just want data. You want smart data.”
So a better approach is, define what you need, make sure you understand exactly what data you’re using to do it, and track what’s coming out of that in and out, so that you continue to build a really good solution. You know, when it comes to data quality, if you don’t have the right data, then you’re going to have a lousy product. No matter what. Or if you have it ill-defined or you’re tracking the wrong thing, then you’re not going to be able to make the right decisions. I live and breathe data. It’s so innate, like I constantly work with data, I see the patterns, I see how you might be able to use it or why it might not be useful, but a lot of times people think you can just take something and be like, “Oh, that’s great. Now, oh, we have a product.” And my response is, “Well, do you? Do you really have a product?”
End of Part 3.
Go back and read Part 1 and Part 2.
Learn about Data Culpa’s solution for data quality intelligence.