There’s a common ingest pattern we see across customers. Data lands in some file/object storage, e.g., S3, as a JSON/Parquet/CSV/etc, and then is ingested into Snowflake, potentially into two tables: one for a raw load and one or more that transform the data in some manner. Here’s a diagram that shows what this pattern looks like.
I have implemented this pattern myself for various projects at past companies. There’s a reason this pattern is popular; it has many nice properties. The object store can serve as a cache when re-processing data or directing the data to multiple target environments, for example. It also becomes a simple source of truth for diagnostics.
But a lot of data teams are coming to realize this pattern is expensive from a resource perspective. After all:
- You store the data multiple times.
- You move the data multiple times.
- Perhaps a significant amount of the data stored in the JSONs turns out not to be needed.
Do you really need to load all that data? Or do you just need some of it?
You’re going to pay for the compute to sort this question out, one way or the other.
Data Meshes Reduce the Need to Move Data
Enter the data mesh: one of the most exciting architectural changes is the idea of not moving the data so much. With a data mesh architecture, data can stay put in the object store (in our example here) without big bulk loads and we can query that data directly to build our downstream transform tables. Even Snowflake is talking about the idea.
With a data mesh, our pipeline changes to this:
So great, now we won’t have to ingest things except for the parts we need. We might use some SQL-to-JSON (or JSON to SQL) tool or whatever technique to enable Snowflake to pull the data from these file layers into the core data environment when we’re ready to do work.
The glaring problem here is what to do about the external data’s fields and contents. Who is responsible for monitoring the formatting of all these files over time? Does it matter?
Well, it certainly does matter if we need to access it. As you can see in the figure above, our first error opportunity might be downstream more than we would like. Delayed loading means delayed error discovery–and increased costs to diagnose problems. If we’re only pulling in parts of the data at different frequencies, we might be surprised later. At Data Culpa, we have found that broken schemas in customer environments are often a proxy for other things going wrong in data flows.
Making Data Mesh Monitoring More Precise with Data Culpa
So you want to be able to monitor the data as it lands in the object store. Ideally, you want to be able to monitor only the data you care about. With the Data Culpa API, you can embed knowledge about which fields are being used from the downstream code to drive alerting and warnings upstream. If you want to re-ingest data or construct new models with historical data at a later date, you might wonder how much data you already have landed will conform to these new schemas if things have changed from the third party.
Data Culpa Validator can help track schema variations and understanding at what point in time a new field came into existence in the past. You can start at any point and provide timestamp metadata to backdate records, meaning if you install Data Culpa into a new environment and process a large existing S3 repository, you can see historical information about when new fields arrived and understand the impact to your models. You can query this from the API and embed this intelligence into your model construction or you can check it ad hoc in our graphical interface.
Data Culpa Validator also enables flexible use cases:
- Comparing data streams
You can compare test data streams from ad hoc sources to other sources that exist in databases, or in object stores. - Making pipelines smarter to reduce alerts
You can also link together different parts of processing to create advanced alert models that minimize alerting on issues you don’t care about because your data infrastructure already resolves them downstream.
When it comes to designing the “top of the pipeline,” it’s clear that the data mesh architecture enables significant cost savings beyond the “four principles” of data meshes–and given the price of cloud computing and the current economic climate, we expect more and more teams to move to a mesh-like architecture. We see the big data warehouses enabling these use cases in addition to new start-ups that are building query engines for the object store (the data lake). We’ve even seen this at “old school” customers who are on-premises and the data lake is an SMB share.
But no matter the generation of your current design, when you move to distributed storage and loosen the coupling, teams will need better, smarter, and more flexible data monitoring to enable governance, whether they think of it as a strict governance or a more casual approach. Eventually, somewhere in the chain, people have to know where to find the values they need. And ensuring that uptime is where we can help; send us an email at hello@dataculpa.com.