ABOUT THIS BLOG

In this blog, we’ll discuss all things data quality and data observability. We’ll present interviews with leading data scientists, and we’ll discuss our own working building data observability solutions for data engineers and data scientists working in a broad range of industries.

MongoDB’s Unique Challenges for Data Quality and Application Development

by | Sep 9, 2022 | Blog, Data Culpa

There’s lots to love about MongoDB. Compared to traditional databases, MongoDB makes developing and standing up new applications fast, thanks to the implicit schema of the collections. Need to add a new field? No big deal, just start using it. Great!

The hierarchical documents and inherent JSON-native capabilities are wonderful to work with for application development: the natural hierarchy lends itself well to mapping into data structures in applications, and most modern programming environments make the data marshalling transparent now. The smart “upsert” capabilities and great client libraries are just two more reasons we love working with Mongo.

Type Changes and Free-floating Schemas

However, the very flexibility that makes Mongo such a breeze to work with can sometimes make it a burden. For example, “wandering” type changes can take significant time to track down. I once wasted an entire afternoon debugging a change where I had inadvertently changed an ObjectId to a string—a tricky mistake to catch because on the screen everything looks like a string.

In our customers running production databases in Mongo, we’ve found shocking numbers of schema variations. And we do mean shocking: in some of our customer production collections, we’ve found over 1,000 schema variations! This lack of predictable schemas creates problems down the line. When your application gets a record, you really are going to struggle to predict how the code will behave.

How is this different from the usual desktop tools? There are many GUI tools to help work with Mongo, but none of them consider schema variations in aggregate. 

Schema Changes in a Customer’s MongoDB Installation

Let’s take a real example from one of our customers. Because this is a customer’s production system, not simply a Data Culpa demo system, we’ve blacked field names to protect the innocent (except for the last column, with permission):

Background pattern

Description automatically generated

Schema variations in a MongoDB production database, as seen in the Data Culpa Validator UI.

Here we’re looking at a single collection. 

  • The leftmost column shows variations of schemas, meaning unique combinations of keys within the objects. 
  • The second column shows the fields in the root of the documents—we can see children fanning out to the right. 

Some of the fields are scalars, and a few are objects with up to two layers of children under them. 

In this example, we’re highlighting one of the schema variations, starting about the 5th one down on the left. You can immediately see which fields are present in that object and which are not. You can see in the rightmost column where we have not obscured the field names, that is_deleted is not present in this schema variation. 

Analyzing MongoDB Schema Changes with Data Culpa Validator

This is just one view of schemas available in our data observability platform, Data Culpa Validator.

Viewing the differences between different schemas found in a single production MongoDB (schema published with permission from the customer). The yellow indicates the baseline for the comparison; the other schemas show the additions or deletions from the yellow baseline.

In other views, Data Culpa Validator enables you to see the volume of records seen at different points in the past to get a feel for which schema variations are dominant, which have emerged, and which are dwindling. As seen in the animation above, Data Culpa Validator also enables you to make comparisons of the schema arrangements across large sets of documents in aggregate. Our belief is that understanding both time variations and documentation variations in aggregate is key to understanding new defects in application behavior and tracking down the root causes.

That’s why we have worked to optimize Data Culpa Validator for a variety of schema scenarios, including:

  • alerting when new fields are added to an object
  • tracking the frequency of usage for a given path down the hierarchy and the consistency of types within those fields

Monitoring your Mongo collections with Data Culpa Validator means you can get ahead of these errors before they impact your applications and analytics. 

Try Monitoring MongoDB Yourself

If this sounds like something your team can use, we offer a three-day Proof-of-Concept trial to show you the variation across your historic data using our open source MongoDB connector with our SaaS and our pricing is transparent. We also support running on-prem, whether on a laptop with Docker, or in your data center. Get in touch by writing to hello@dataculpa.com

Have Questions?

Feel free to reach out to us!

NEWSLETTER SIGN UP

Subscribe to the Data Culpa Newsletter