In this blog, we’ll discuss all things data quality and data observability. We’ll present interviews with leading data scientists, and we’ll discuss our own working building data observability solutions for data engineers and data scientists working in a broad range of industries.

Data Quality in Healthcare: An Interview with Shawn Stapleton, PhD, Part 2

by | Mar 15, 2021 | Blog, Data Culpa, Interview

As a leader in healthcare innovation at Philips Healthcare, Shawn Stapleton manages a portfolio of technology solutions that have immediate impact for hospitals around the world. Shawn works with a network of clinical partners to identify key needs and to develop novel solutions that improve the quality and efficiency of clinical care delivery. Shawn sits on several steering committees within Philips to develop and implement innovation strategies across multiple businesses, as well as with multiple granting agencies to advise on national research initiatives in healthcare. As Director of Innovation, Shawn strives put research into practice, bringing ideas to market, and working with worldwide leaders to deploy the next generation of healthcare innovations.

Shawn has a PhD in Medical Biophysics from the University of Toronto and a B.S. with honors in Physics and Computer Science from the University of Victoria. 

Shawn is an investor in Data Culpa, Inc. and a member of the company’s Board of Advisors.

Read Part 1 of our interview here.

Data Quality and the Lack of Standardized Data Tools in Healthcare

Data Culpa:  At Data Culpa, our big theme is data quality. I’m wondering if a data team working with an ad hoc assortment of tools as you’ve described creates any data quality challenges. Does a lack of standardization pose any risks for the consistency or quality of data?

Shawn Stapleton:  There are many layers of complexity in this question. I’ll take you through how data is generated and flows from source to data scientist in healthcare to try and shed light on data quality challenges in healthcare.

Healthcare data is largely produced to drive the business of healthcare delivery. We tend to think that healthcare data primarily consists of clinical information: lab results, genomic information, radiology images and reports, pathology reports, and so on.

Healthcare data is largely produced to drive the business of healthcare delivery.

However, there’s an entire operational machinery to drive a patient examination and the generation of a clinical result. Ordering, exam scheduling, staff scheduling, examination workflow, and billing are all examples of operational healthcare data.

Today this data is stored in a collection of hospital information systems. The main system we tend to discuss is the electronic medical record (EMR) system. In reality, information is also stored in lab information systems, radiology information systems, billing systems, scheduling systems, and more. Some of these systems are department-dependent or are duplicated across departments. Rarely are these systems static. Systems are being upgraded, added to, and migrated over time. The integration, support, and maintenance of these systems results in data quality issues.

Nowadays healthcare data is largely digital. But this wasn’t always the case. Back in the day, hospitals , clinics, and private practices managed their business using paper records. The switch to digital is actually quite recent, starting in the 80’s and only really become widely adopted  after passing of the HITECH Act, which was part of the American Recovery and Reinvestment Act (ARRA) in 2009 which introduced the  concept of meaningful use. The switch to digital meant that a paper cased records need to be digitized to maintain continuity of patient care. This been an ongoing and challenging process, fraught with data quality issues.

Nowadays healthcare data is largely digital.

So what we have in healthcare today is a collection of operational and clinical data stored across multiple systems, with varying data quality. This has profound implications for data science and machine learning. Getting access to data means working closely with the hospital IT to query multiple systems. When you run in data linking and/or data quality issues, or missing data you need to go back to hospital IT to understand the data quality issues and/or get new data. Hospital IT is not in the business of machine learning; they’re in the business of supporting hospital operations. So, you can imagine this process is time consuming. Sometimes you settle for the data you get, and do your best to remove data quality issues with little understanding if the data you removed is indeed that was important and representative.

So what typically happens for data science projects in healthcare is the collecting of heterogeneous data with limited metadata providing context about how the data was collected and how the data is changing over time. Data was likely integrated from multiple healthcare systems, each with varying data quality challenges. The evolution of healthcare systems and devices result in data drift and increased potential for new data quality issues to arise. Without context, data scientists rely on subject matter experts who has been around long enough with knowledge of the data and context in which it was generated.

What typically happens for data science projects in healthcare is the collecting of heterogeneous data with limited metadata providing context about how the data was collected and how the data is changing over time.

Data Culpa:  Do you see a movement underway to standardize tools and processes to make these changes more likely? For example, let’s say a hospital group decides they want to improve radiology outcomes for their patients, and they think that applying AI might be able to do that. By now, they probably recognize that they need to address data quality problems like data drift, and they need to standardize tools. Do you see any sort of top-down initiative looking at addressing this as an issue of enterprise architecture?

Shawn Stapleton:  Absolutely. I mean, it’s top-down and bottom-up, and sideways and it’s coming from every side. There’s a lot of work being done here, but I don’t believe there’s any solid production tools to accomplish this. The real challenge in my mind is understanding data quality across an intertwined collected of systems that evolve over time. There’s substantial value in provided insights into data quality issues that are connected across data sources used in healthcare and processing pipelines used for machine learning.


End of Part 2. Go back and read Part 1.

Have Questions?

Feel free to reach out to us!


Subscribe to the Data Culpa Newsletter