ABOUT THIS BLOG

In this blog, we’ll discuss all things data quality and data observability. We’ll present interviews with leading data scientists, and we’ll discuss our own working building data observability solutions for data engineers and data scientists working in a broad range of industries.

Using Timeshift in Data Culpa Validator

by | Nov 4, 2021 | Data Culpa

One of the best features we offer customers who are just getting started with Data Culpa is our “timeshift” feature. Timeshift lets us extract a point in time for a row or a document and have Validator evaluate the data contained within that row or document as if it had been processed by Validator on the specified date.

This means we can install Data Culpa on an application that has been running for months or even years and begin building insights about the data the application has already processed and when. Validator displays its insights about that data as if it had been running on that application the entire time.

We support timeshift in our open API as well as for Snowflake, MongoDB, Azure Data Lakes, and BigQuery installations — and we are rolling it out into every platform that we support, including file data.

All that’s required is that we can extract a date from a field. We support a large number of date formats. For MongoDB, we can extract the date from an ObjectId if a custom field is used.

Timeshift also provides a way to re-capture records that undergo ongoing re-enrichment. Some customers use an “append only” log format for their data and others update records over time. If you have a ‘created time’ and an ‘updated time’, we can use those to slot and re-slot data changes when ingesting data changes for profiling.

The only restriction on timeshift is that time must be ascending; you can’t load records from last week if you have already loaded records from this week. We do offer ways of addressing that conundrum with our top-level metadata, which we will discuss in another post.

This screenshot below shows a sample data load. We’ve taken in 60 days of old data, and we got an alert for that data issued in the context of the data loaded. We also see the graphs for the days that the data identifies itself as belong to, not for when it was loaded. Of course, the loaded time is tracked as well, so we can unravel what’s going on if we need to.

Interested in trying Validator yourself? Sign up for a free trial.

Have Questions?

Feel free to reach out to us!

NEWSLETTER SIGN UP

Subscribe to the Data Culpa Newsletter