Just as Maslow identified a hierarchy of needs for people, data teams have a hierarchy of needs, beginning with data freshness; including volumes, schemas, and values; and culminating with lineage.
In this blog post, which was published in the Data Science area of the popular website KDnuggets.com, Data Culpa CTO explains why data quality requires its own hierarchy of needs. Here’s an excerpt:
Maslow identified a hierarchy of needs for human beings, beginning with a broad base of physiological needs such as food and shelter, rising through social needs, and capping off with creativity and self-realization.
Data teams have their own hierarchy of needs focused on making sure the data is right. Having worked with data engineers and data scientists in companies large and small, here’s how I would describe a data team’s hierarchy of needs for data quality, starting at the ground floor with data freshness, where we determine whether or not we have the data we need in the first place.
Ideally, a team should have tools in place to measure data quality at each of these levels.
Layer 1: Data Freshness
To have good quality data, you have to make sure you have data in the first place.
Did your pipeline run? If so, when? Did the pipeline produce new data, or did it produce the same data as a previous run? To check for stale data, one must proceed carefully and extensively to identify stuck data feeds. It’s easy to detect that nothing has updated, but in many cases, a sequence number or timestamp movement doesn’t mean that the rest of the data has moved.
Layer 2: Data Volumes
Once data freshness is established, the next need to address concerns data volumes. We might ask questions such as:
- Are the data volumes in line with what we expected?
- If data volumes vary, are they in line with trends we’ve already observed in our business, such as little activity on weekends or in the middle of the night?
In some businesses, such as retailers, holidays and run up to holidays may have different volumes. Depending on the application, analyzing data volumes might require robust data models that consider multiple features.