Stop Profiling Data Manually: How AI Can Improve Data Quality in Databricks

Andrzej Garyel
Andrzej Garyel
March 24, 2026
5 min read
Loading the Elevenlabs Text to Speech AudioNative Player...

The Real Problem Starts After Ingestion

Getting data into Bronze is the easy part. Databricks Auto Loader, COPY INTO, a couple of Spark jobs. You can move terabytes from source to landing zone before lunch. The medallion architecture looks clean on the whiteboard. Everything seems fine, but can we trust this data enough to promote it to Silver?

That's where things get complicated. New sources show up with column names that could mean anything, inconsistent formatting, and quality issues that no schema definition will ever catch. A column called status has three clean values today. Tomorrow there's a fourth, misspelled. A date_of_birth field is YYYY-MM-DD in one partition and DD/MM/YYYY in the next. An amount column has negative values that look like bugs but are refunds. Or maybe they are bugs. Good luck telling the difference from the schema alone.

If you've done data engineering for any length of time, you know the drill. Every new dataset means pulling up a notebook, eyeballing column distributions, checking null percentages, counting distinct values, hunting for outliers, and trying to reverse-engineer what each field means in business terms. Multiply that by hundreds of tables across multiple teams, and you've got a bottleneck that never goes away. The work is repetitive, but it doesn't compound. Every new table is a fresh start.

Where AI Actually Helps (and Where It Doesn't)

Let's be clear about what AI does here. It doesn't replace your judgment. It handles the tedious first pass: scanning a schema, reviewing a sample, and coming back with draft observations and suggested rules. The kind of work that takes an engineer hours but follows predictable enough patterns that a model can do a decent first cut in minutes.

Schema and Sample Analysis

Hand an LLM a table's column names, data types, and a sample of actual values, and it can produce surprisingly useful analysis. It'll flag that customer_email probably contains PII. It'll notice a country_code column with only two-letter values and suggest ISO 3166-1 alpha-2 validation. It'll point out that deleted_at is 94% null, which might be totally normal for a soft-delete pattern, or might mean something is broken upstream. You still need to decide which interpretation is right, but at least the question is on the table.

Drafting Quality Rules

From that analysis, the model can draft concrete rules: not-null constraints for required fields, allowed-value lists for categorical columns, range boundaries for numeric columns, format checks for emails and phone numbers, even cross-field consistency checks where relationships between columns are obvious enough to infer.

Databricks Labs' DQX framework supports exactly this workflow. Its profiler crunches column-level statistics (min, max, nulls, distinct counts, distributions) and generates rule candidates automatically. Layer AI-assisted generation on top, and you get rules that reflect semantic context, not just statistical summaries. There's a difference between "this column has 3 distinct values" and "this column looks like an order status field that should only contain NEW, PROCESSING, and DONE."

Flagging Sensitive Data

This is one area where LLMs is perfect. Column names and sample values are often enough to identify fields containing names, emails, phone numbers, national IDs, or other sensitive information, and suggest governance tags accordingly.

Databricks' own internal system, LogSentinel, proves this works at scale. They use LLMs to classify columns across the platform's internal logs and databases, hitting 92% precision and 95% recall for PII detection. That cut their manual audit review time from weeks to hours.

A Practical Workflow (Using Tools You Already Have)

Nothing here requires exotic infrastructure. If you're running Databricks with Unity Catalog, DLT, and standard notebooks, you have everything you need.

Step 1: Profile the Data

When a new source lands in Bronze, run a profiling step. Collect schema metadata, compute summary stats (null counts, distinct values, min/max, distributions), and pull a representative sample. DQX Profiler handles this out of the box. What you get back is a structured picture of what the data looks like, which is often quite different from what the schema says it should look like.

Step 2: Let the LLM Take a First Pass

Feed the profile, column names, and sample values to an LLM on a Model Serving endpoint in your workspace. It returns draft quality rules in a structured format, each one specifying the target column, the type of check, and why it thinks the check makes sense. Takes about a minute. The DQX App  wraps this up with a frontend where you can describe quality requirements in plain English and get back executable DQX-compatible checks.

Step 3: You Review and Fix What the Model Got Wrong

This is the step that matters most. Go through the generated rules. Adjust thresholds. Remove checks that don't apply to your domain. Add rules that require business knowledge the model couldn't possibly have, like knowing that negative amounts in this table are always errors, even though in another system they'd be valid refunds. The model builds the scaffolding. You do the finish work.

Step 4: Ship It into the Pipeline

Approved rules get applied as DLT expectations, integrated into DQX validation workflows, or registered in Data Quality Monitoring for ongoing drift detection. You decide what happens to records that fail: drop them, flag them with a warning column, or quarantine them for separate review. Different rules can have different severity levels.

Some Examples That Make This Concrete

Catching values that shouldn't exist. Profile a status column in an orders table. The sample shows exactly three values: NEW, PROCESSING, DONE. Set up an allowed-values rule. Now when someone's integration starts sending PROCESING (typo), CANCELLED (new state nobody communicated), or nulls that shouldn't be there, it gets caught at the Bronze-to-Silver boundary instead of showing up as a weird spike in a Monday morning dashboard.

Spotting PII that's hiding in plain sight. A customer_email column is straightforward, the name gives it away. But what about a column called contact_ref that turns out to contain Social Security numbers? The model catches it from the sample values, not the name. Flag it, tag it in Unity Catalog, suggest masking rules for downstream consumers.

Relationships between columns that don't add up. Some records have a delivery_date filled in but order_status is still NEW. That's logically wrong. You can't have a delivery date before the order has been processed. A cross-field rule (delivery_date IS NULL WHEN order_status = 'NEW') catches this. Writing rules like this requires understanding what columns mean relative to each other, exactly the kind of inference LLMs are decent at when they have column names and sample data.

Source problems that hide in wide tables. A quantity column declared as integer, but a handful of records contain "N/A", "TBD", or empty strings. Buried in a table with 60 columns, you'd probably miss this during manual review. A systematic profiling scan won't.

Why You Still Can't Skip the Human Review

Here's the part where I push back on the "AI magic" narrative a bit.

Sample data lies. Run profiling on the first 1,000 rows of a partitioned table and you might see 0% nulls in a column. The next partition? 30% null, because something changed in the source system last month. If you blindly ship a not-null constraint based on that sample, your pipeline breaks on the next batch.

Similarly, a model might see a score column with values between 0 and 100 and suggest a range check. Makes sense, except this score is a raw sensor reading with a legitimate range of -40 to 300. The suggestion was reasonable. It was also wrong.

What changes is the kind of work you're doing. Instead of staring at a blank notebook and writing every null check, format validation, and allowed-values constraint by hand, you start with a draft that's mostly right. You spend your time reviewing and correcting instead of authoring from scratch. It's faster, less mind-numbing, and produces more consistent results across teams because everyone starts from the same generated baseline instead of reinventing their own approach.

What Your Organization Gets Out of This

Faster onboarding of new data sources. The gap between "data landed in Bronze" and "data is validated and ready for Silver" shrinks significantly. The profiling and rule generation cycle that used to eat a full day of focused work can be done in under an hour, and most of that hour is review, not writing.

Teams that do things the same way. One of the most annoying problems in large data orgs: Team A uses Great Expectations, Team B writes raw SQL assertions, Team C does manual spot checks, and nobody's rules look alike. Generating rules from a common framework like DQX means standardized formats, consistent naming, and uniform severity levels across the board.

Problems caught before they become incidents. When quality rules go live faster, issues get flagged earlier. A schema drift that would have gone unnoticed until a dashboard broke on Monday morning. Caught at the Bronze-to-Silver transition Saturday night. Data Quality Monitoring adds another layer by detecting anomalies in table freshness and completeness using historical patterns.

Governance that isn't an afterthought. Automated PII detection means fewer columns slip through without proper classification. If you're operating under GDPR, HIPAA, or SOX, catching sensitive fields early is the difference between a smooth audit and a scramble.

Getting Started Without Overthinking It

You don't need a perfect setup. Start small, see what works, and expand.

  • Week one: Run DQX Profiler on your ten most critical Bronze tables. Look at the generated statistics and rule candidates. Don't deploy anything yet, just see what it finds.
  • Week two: Stand up a Model Serving endpoint and test AI-assisted rule generation on a few tables. Compare what it produces against rules your team would write by hand. Where does it get things right? Where does it miss?
  • Week three: Pick the best rules, integrate them into your DLT pipelines as expectations. Set up quarantine tables for records that fail. Track how many issues get caught before reaching Silver.
  • Week four: Expand to more domains. Turn on Data Quality Monitoring for drift detection on your key tables. Take what you've learned about where the model succeeds and fails, and fold that back into your prompt templates to improve output quality over time.

Share this post
Data Engineering
Andrzej Garyel
MORE POSTS BY THIS AUTHOR
Andrzej Garyel

Curious how we can support your business?

TALK TO US