Why You Need a Data Quality Strategy for External Data

city scape with bright lights showing data connections

A company’s data is its most valuable asset — that’s a commonplace observation in board rooms and IT labs today. Every organization recognizes the importance of its data. Somewhere in the mission statement for most digital transformation projects, there’s probably a remark about becoming “a more data-driven organization” or “more fully leveraging the data that we have.”

Data is obviously part of what makes a company unique and therefore valuable. No two companies have the same data. (If they did, the less capable of those two companies wouldn’t be around very long.)

But even if your organization has a wealth of proprietary data, your organization’s success probably depends on enhancing that data with external data and broadening the scope of your data applications.

Data scientists recognize this. In a recent Deloitte survey, 92% of data analytics professionals said their companies should increase their use of external data.

A 2018 MIT Sloan Management Review report on data and data analytics found that:

The most analytically mature organizations use more data sources, including data from customers, vendors, regulators, and competitors. “Analytical innovators,” or companies that incorporate analytics into most aspects of decision-making, are four times more likely than less mature organizations to use all four data sources, and are more likely to use a variety of data types, including mobile, social, and public data.

There’s a wealth of external data out there to tap into, everything from transaction data from partners to financial market data to weather data to data feeds from IoT devices — the list goes on.

Ingesting that data and combining it in innovative ways can yield powerful analytical insights. It can also lead to the development of valuable new products and services.

Data Quality Challenges with External Data

But external data brings challenges, too, particularly in the area of data quality.

Your company already has data quality controls in place for internal data, especially structured data such as records in customer databases and financial applications. Your company controls the data objects in those repositories and applications. You can define the data schemas and put data management controls in place to ensure that a customer record in your CRM, for example, matches the format of a customer record in your billing application. If something goes wrong with internal data, you can assign a data steward to investigate the difference and correct it.

You don’t have that kind of visibility and control with external data. Schemas can change unexpectedly. Columns might be added, removed, or reshuffled. Column names might change without notice. And data values might drift slowly or lurch suddenly, breaking analytical results further down a pipeline.

You don’t have that kind of visibility and control with external data. Schemas can change unexpectedly. Columns might be added, removed, or reshuffled.

Ideally, the company or organization providing the external data should warn you about these changes, but the data teams we’ve talked to tell us they hardly ever do.

So data engineers and data scientists discover these changes the hard way — after the changes have already broken a data pipeline’s functioning or skewed its results. In some cases, undetected changes in external data could corrupt business transactions or render a business intelligence report meaningless.

To catch these problems, a data team can sit down and write unit tests, defining expectations for all these columns in all their external data feeds. But that’s a mind-boggling amount of work. There are simply too many fields with too many expectations in too many data feeds to manually write unit tests for all of them. And companies are adding new data feeds and data use cases all the time.

Unit tests or expectations have a role to play. But they’ll never provide the comprehensive coverage that data teams need for discovering problems in external data feeds before those problems jeopardize business results.

Automating Data Quality Analysis with Data Culpa

That’s why at Data Culpa, we’re building a new approach to data quality intelligence, automatically detecting changes in data patterns and alerting data engineers and data scientists in time for them to inspect the changes and, if necessary, take corrective action.

Interested in trying the Data Culpa solution yourself? Contact us at freetrial@dataculpa.com.

Accounting for Bias in Data Analytics: An Interview with Lauren S. Moores, PhD, Part 3

Lauren S. Moores, Head of Data Sciences InnovationLauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes, ensuring those platforms maintain high data quality, and accounting for bias in data analytics. (Read Part 1 of our interview here and Part 2 of our interview here.)

Continue reading “Accounting for Bias in Data Analytics: An Interview with Lauren S. Moores, PhD, Part 3”

Different Roles in Data Science: An Interview with Lauren S. Moores, PhD, Part 2

Lauren S. Moores, Head of Data Sciences InnovationLauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes, ensuring those platforms maintain high data quality, and different roles in data science for engineers and subject matter experts. (Read Part 1 of our interview here.)

Continue reading “Different Roles in Data Science: An Interview with Lauren S. Moores, PhD, Part 2”

Building Data Science Systems: An Interview with Lauren S. Moores, PhD, Part 1

Lauren S. Moores, Head of Data Sciences InnovationLauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes and ensuring those platforms maintain high data quality.

Continue reading “Building Data Science Systems: An Interview with Lauren S. Moores, PhD, Part 1”

Data Quality Problems and How to Fix Them: An Interview with Michal Klos, Part 2

Michal KlosData engineering teams need to be able to find and fix data quality problems quickly. In part 2 of our interview with Michal Klos, Michal discusses the challenges of discovering and troubleshooting data quality problems in pipelines. Michal is Sr. Director of Engineering at Indigo Ag, where he is helping the fight against climate change. Michal has been leading data engineering and platform teams for nearly two decades. Before joining Indigo Ag, he built and scaled massive data platforms at multiple companies, including Localytics, Compete, and WB Games. You’ll find Part 1 of our interview with Michal here.

Continue reading “Data Quality Problems and How to Fix Them: An Interview with Michal Klos, Part 2”

Data Quality Challenges for Data Scientists: An Interview with Michal Klos, Part 1

Michal KlosData quality is an ongoing challenge for data scientists and data engineers. In this interview, Michal Klos share the insights he’s gleaned from a long career building data pipelines and ensuring they deliver high-quality results. Michal is Sr. Director of Engineering at Indigo Ag, where he is helping the fight against climate change. Michal has been leading data engineering and platform teams for nearly two decades. Before joining Indigo Ag, he built and scaled massive data platforms at multiple companies, including Localytics, Compete, and WB Games.

Continue reading “Data Quality Challenges for Data Scientists: An Interview with Michal Klos, Part 1”