Introducing Data Culpa Validator

Data Culpa Validator User InterfaceWhat’s your data doing? Can you tell?

Data teams tell us they need better visibility into data pipelines, integrations, repositories, and data lakes.

You can hand-code a bunch of unit tests to check for known boundary cases. But coverage will always be limited.

And unit tests don’t give you visibility across time.

How does today’s data ingest compare to yesterday’s or last week’s? Are schemas changing? Data values drifting? Is this week’s spike an anomaly or just a typical end-of-quarter surge?

Is your external data feed about to corrupt your results?

Observe Data Changes over Time

At Data Culpa, we’ve built a SaaS service that gives you a birds-eye view of your data.

Using our service, you can:

  • See what’s happening with your data at a glance.
  • Compare the data you’re getting to the data you’re expecting. Compare activity across time.
  • Receive alerts, and dig into what’s happening with your data before it jeopardizes business results.

We call our service Validator, because it helps you validate the data your company depends on.

Our early access program is open. Interested? Sign up today.

Why You Need a Data Quality Strategy for External Data

city scape with bright lights showing data connections

A company’s data is its most valuable asset — that’s a commonplace observation in board rooms and IT labs today. Every organization recognizes the importance of its data. Somewhere in the mission statement for most digital transformation projects, there’s probably a remark about becoming “a more data-driven organization” or “more fully leveraging the data that we have.”

Data is obviously part of what makes a company unique and therefore valuable. No two companies have the same data. (If they did, the less capable of those two companies wouldn’t be around very long.)

But even if your organization has a wealth of proprietary data, your organization’s success probably depends on enhancing that data with external data and broadening the scope of your data applications.

Data scientists recognize this. In a recent Deloitte survey, 92% of data analytics professionals said their companies should increase their use of external data.

A 2018 MIT Sloan Management Review report on data and data analytics found that:

The most analytically mature organizations use more data sources, including data from customers, vendors, regulators, and competitors. “Analytical innovators,” or companies that incorporate analytics into most aspects of decision-making, are four times more likely than less mature organizations to use all four data sources, and are more likely to use a variety of data types, including mobile, social, and public data.

There’s a wealth of external data out there to tap into, everything from transaction data from partners to financial market data to weather data to data feeds from IoT devices — the list goes on.

Ingesting that data and combining it in innovative ways can yield powerful analytical insights. It can also lead to the development of valuable new products and services.

Data Quality Challenges with External Data

But external data brings challenges, too, particularly in the area of data quality.

Your company already has data quality controls in place for internal data, especially structured data such as records in customer databases and financial applications. Your company controls the data objects in those repositories and applications. You can define the data schemas and put data management controls in place to ensure that a customer record in your CRM, for example, matches the format of a customer record in your billing application. If something goes wrong with internal data, you can assign a data steward to investigate the difference and correct it.

You don’t have that kind of visibility and control with external data. Schemas can change unexpectedly. Columns might be added, removed, or reshuffled. Column names might change without notice. And data values might drift slowly or lurch suddenly, breaking analytical results further down a pipeline.

You don’t have that kind of visibility and control with external data. Schemas can change unexpectedly. Columns might be added, removed, or reshuffled.

Ideally, the company or organization providing the external data should warn you about these changes, but the data teams we’ve talked to tell us they hardly ever do.

So data engineers and data scientists discover these changes the hard way — after the changes have already broken a data pipeline’s functioning or skewed its results. In some cases, undetected changes in external data could corrupt business transactions or render a business intelligence report meaningless.

To catch these problems, a data team can sit down and write unit tests, defining expectations for all these columns in all their external data feeds. But that’s a mind-boggling amount of work. There are simply too many fields with too many expectations in too many data feeds to manually write unit tests for all of them. And companies are adding new data feeds and data use cases all the time.

Unit tests or expectations have a role to play. But they’ll never provide the comprehensive coverage that data teams need for discovering problems in external data feeds before those problems jeopardize business results.

Automating Data Quality Analysis with Data Culpa

That’s why at Data Culpa, we’re building a new approach to data quality intelligence, automatically detecting changes in data patterns and alerting data engineers and data scientists in time for them to inspect the changes and, if necessary, take corrective action.

Interested in trying the Data Culpa solution yourself? Contact us at freetrial@dataculpa.com.

Data Quality in Healthcare: An Interview with Shawn Stapleton, PhD, Part 2

Shawn Stapleton, PhDAs a leader in healthcare innovation at Philips Healthcare, Shawn Stapleton manages a portfolio of technology solutions that have immediate impact for hospitals around the world. Shawn works with a network of clinical partners to identify key needs and to develop novel solutions that improve the quality and efficiency of clinical care delivery. Shawn sits on several steering committees within Philips to develop and implement innovation strategies across multiple businesses, as well as with multiple granting agencies to advise on national research initiatives in healthcare. As Director of Innovation, Shawn strives put research into practice, bringing ideas to market, and working with worldwide leaders to deploy the next generation of healthcare innovations.

Shawn has a PhD in Medical Biophysics from the University of Toronto and a B.S. with honors in Physics and Computer Science from the University of Victoria. 

Shawn is an investor in Data Culpa, Inc. and a member of the company’s Board of Advisors.

Read Part 1 of our interview here.

Data Quality and the Lack of Standardized Data Tools in Healthcare

Data Culpa:  At Data Culpa, our big theme is data quality. I’m wondering if a data team working with an ad hoc assortment of tools as you’ve described creates any data quality challenges. Does a lack of standardization pose any risks for the consistency or quality of data?

Shawn Stapleton:  There are many layers of complexity in this question. I’ll take you through how data is generated and flows from source to data scientist in healthcare to try and shed light on data quality challenges in healthcare.

Healthcare data is largely produced to drive the business of healthcare delivery. We tend to think that healthcare data primarily consists of clinical information: lab results, genomic information, radiology images and reports, pathology reports, and so on.

Healthcare data is largely produced to drive the business of healthcare delivery.

However, there’s an entire operational machinery to drive a patient examination and the generation of a clinical result. Ordering, exam scheduling, staff scheduling, examination workflow, and billing are all examples of operational healthcare data.

Today this data is stored in a collection of hospital information systems. The main system we tend to discuss is the electronic medical record (EMR) system. In reality, information is also stored in lab information systems, radiology information systems, billing systems, scheduling systems, and more. Some of these systems are department-dependent or are duplicated across departments. Rarely are these systems static. Systems are being upgraded, added to, and migrated over time. The integration, support, and maintenance of these systems results in data quality issues.

Nowadays healthcare data is largely digital. But this wasn’t always the case. Back in the day, hospitals , clinics, and private practices managed their business using paper records. The switch to digital is actually quite recent, starting in the 80’s and only really become widely adopted  after passing of the HITECH Act, which was part of the American Recovery and Reinvestment Act (ARRA) in 2009 which introduced the  concept of meaningful use. The switch to digital meant that a paper cased records need to be digitized to maintain continuity of patient care. This been an ongoing and challenging process, fraught with data quality issues.

Nowadays healthcare data is largely digital.

So what we have in healthcare today is a collection of operational and clinical data stored across multiple systems, with varying data quality. This has profound implications for data science and machine learning. Getting access to data means working closely with the hospital IT to query multiple systems. When you run in data linking and/or data quality issues, or missing data you need to go back to hospital IT to understand the data quality issues and/or get new data. Hospital IT is not in the business of machine learning; they’re in the business of supporting hospital operations. So, you can imagine this process is time consuming. Sometimes you settle for the data you get, and do your best to remove data quality issues with little understanding if the data you removed is indeed that was important and representative.

So what typically happens for data science projects in healthcare is the collecting of heterogeneous data with limited metadata providing context about how the data was collected and how the data is changing over time. Data was likely integrated from multiple healthcare systems, each with varying data quality challenges. The evolution of healthcare systems and devices result in data drift and increased potential for new data quality issues to arise. Without context, data scientists rely on subject matter experts who has been around long enough with knowledge of the data and context in which it was generated.

What typically happens for data science projects in healthcare is the collecting of heterogeneous data with limited metadata providing context about how the data was collected and how the data is changing over time.

Data Culpa:  Do you see a movement underway to standardize tools and processes to make these changes more likely? For example, let’s say a hospital group decides they want to improve radiology outcomes for their patients, and they think that applying AI might be able to do that. By now, they probably recognize that they need to address data quality problems like data drift, and they need to standardize tools. Do you see any sort of top-down initiative looking at addressing this as an issue of enterprise architecture?

Shawn Stapleton:  Absolutely. I mean, it’s top-down and bottom-up, and sideways and it’s coming from every side. There’s a lot of work being done here, but I don’t believe there’s any solid production tools to accomplish this. The real challenge in my mind is understanding data quality across an intertwined collected of systems that evolve over time. There’s substantial value in provided insights into data quality issues that are connected across data sources used in healthcare and processing pipelines used for machine learning.

TO BE CONTINUED.

End of Part 2. Go back and read Part 1.

Accounting for Bias in Data Analytics: An Interview with Lauren S. Moores, PhD, Part 3

Lauren S. Moores, Head of Data Sciences InnovationLauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes, ensuring those platforms maintain high data quality, and accounting for bias in data analytics. (Read Part 1 of our interview here and Part 2 of our interview here.)

Continue reading “Accounting for Bias in Data Analytics: An Interview with Lauren S. Moores, PhD, Part 3”

Different Roles in Data Science: An Interview with Lauren S. Moores, PhD, Part 2

Lauren S. Moores, Head of Data Sciences InnovationLauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes, ensuring those platforms maintain high data quality, and different roles in data science for engineers and subject matter experts. (Read Part 1 of our interview here.)

Continue reading “Different Roles in Data Science: An Interview with Lauren S. Moores, PhD, Part 2”

Building Data Science Systems: An Interview with Lauren S. Moores, PhD, Part 1

Lauren S. Moores, Head of Data Sciences InnovationLauren S. Moores, a.k.a. “the Data Queen,” is Head of Data Sciences Innovation for FL65 and Invaio Sciences, two companies launched by Flagship Pioneering, a venture capital and private equity firm dedicated to creating breakthroughs in human health and sustainability and build life sciences companies. She is also Chair of the Data Advisory Board for USA for UNHCR, a non-profit organization supporting the UN Refugee Agency, and a Teaching Fellow at the Harvard Business Analytics Program. We spoke to Lauren about building data science systems for companies of all sizes and ensuring those platforms maintain high data quality.

Continue reading “Building Data Science Systems: An Interview with Lauren S. Moores, PhD, Part 1”

Data Quality Problems and How to Fix Them: An Interview with Michal Klos, Part 2

Michal KlosData engineering teams need to be able to find and fix data quality problems quickly. In part 2 of our interview with Michal Klos, Michal discusses the challenges of discovering and troubleshooting data quality problems in pipelines. Michal is Sr. Director of Engineering at Indigo Ag, where he is helping the fight against climate change. Michal has been leading data engineering and platform teams for nearly two decades. Before joining Indigo Ag, he built and scaled massive data platforms at multiple companies, including Localytics, Compete, and WB Games. You’ll find Part 1 of our interview with Michal here.

Continue reading “Data Quality Problems and How to Fix Them: An Interview with Michal Klos, Part 2”