Why You Need a Data Quality Strategy for External Data

city scape with bright lights showing data connections

A company’s data is its most valuable asset — that’s a commonplace observation in board rooms and IT labs today. Every organization recognizes the importance of its data. Somewhere in the mission statement for most digital transformation projects, there’s probably a remark about becoming “a more data-driven organization” or “more fully leveraging the data that we have.”

Data is obviously part of what makes a company unique and therefore valuable. No two companies have the same data. (If they did, the less capable of those two companies wouldn’t be around very long.)

But even if your organization has a wealth of proprietary data, your organization’s success probably depends on enhancing that data with external data and broadening the scope of your data applications.

Data scientists recognize this. In a recent Deloitte survey, 92% of data analytics professionals said their companies should increase their use of external data.

A 2018 MIT Sloan Management Review report on data and data analytics found that:

The most analytically mature organizations use more data sources, including data from customers, vendors, regulators, and competitors. “Analytical innovators,” or companies that incorporate analytics into most aspects of decision-making, are four times more likely than less mature organizations to use all four data sources, and are more likely to use a variety of data types, including mobile, social, and public data.

There’s a wealth of external data out there to tap into, everything from transaction data from partners to financial market data to weather data to data feeds from IoT devices — the list goes on.

Ingesting that data and combining it in innovative ways can yield powerful analytical insights. It can also lead to the development of valuable new products and services.

Data Quality Challenges with External Data

But external data brings challenges, too, particularly in the area of data quality.

Your company already has data quality controls in place for internal data, especially structured data such as records in customer databases and financial applications. Your company controls the data objects in those repositories and applications. You can define the data schemas and put data management controls in place to ensure that a customer record in your CRM, for example, matches the format of a customer record in your billing application. If something goes wrong with internal data, you can assign a data steward to investigate the difference and correct it.

You don’t have that kind of visibility and control with external data. Schemas can change unexpectedly. Columns might be added, removed, or reshuffled. Column names might change without notice. And data values might drift slowly or lurch suddenly, breaking analytical results further down a pipeline.

You don’t have that kind of visibility and control with external data. Schemas can change unexpectedly. Columns might be added, removed, or reshuffled.

Ideally, the company or organization providing the external data should warn you about these changes, but the data teams we’ve talked to tell us they hardly ever do.

So data engineers and data scientists discover these changes the hard way — after the changes have already broken a data pipeline’s functioning or skewed its results. In some cases, undetected changes in external data could corrupt business transactions or render a business intelligence report meaningless.

To catch these problems, a data team can sit down and write unit tests, defining expectations for all these columns in all their external data feeds. But that’s a mind-boggling amount of work. There are simply too many fields with too many expectations in too many data feeds to manually write unit tests for all of them. And companies are adding new data feeds and data use cases all the time.

Unit tests or expectations have a role to play. But they’ll never provide the comprehensive coverage that data teams need for discovering problems in external data feeds before those problems jeopardize business results.

Automating Data Quality Analysis with Data Culpa

That’s why at Data Culpa, we’re building a new approach to data quality intelligence, automatically detecting changes in data patterns and alerting data engineers and data scientists in time for them to inspect the changes and, if necessary, take corrective action.

Interested in trying the Data Culpa solution yourself? Contact us at freetrial@dataculpa.com.