"SQL is the New Regex": Cleaning Messy Data with Databricks ai_query

Andrzej Gabryel
Andrzej Gabryel
April 9, 2026
min read
Loading the Elevenlabs Text to Speech AudioNative Player...

"SQL is the New Regex": Cleaning Messy Data with Databricks ai_query

As Data Engineers, we spend a massive portion of our week doing work that nobody notices until it breaks: data cleaning.

We build pipelines on Azure Databricks, orchestrate them with Workflows, and scale the compute. But at the end of the day, our pipelines are only as good as the data flowing through them. And real-world data is messy. People make typos. Source systems change their log formats without warning. Third-party APIs send JSON payloads with inconsistent, deeply nested structures.

For decades, our primary weapon against this chaos has been Regular Expressions (Regex). If you needed to extract a product code from a customer review, you wrote a Regex pattern. If you needed to mask an email address, you wrote a Regex pattern.

You had a problem, so you decided to use Regex. Now you have two problems.

Regex is powerful, but it is incredibly brittle. A single unexpected character in the source data can break your extraction logic. Furthermore, Regex is practically unreadable. A complex pattern looks like a cat walked across your keyboard. Maintaining a pipeline filled with complex regexp_extract functions is a nightmare for the next engineer who takes over your code.

In 2026, the data engineering landscape has shifted. We are no longer limited to rigid string matching.

Today, I want to talk about how we can replace complex, unmaintainable Regex with semantic processing using the ai_query function natively in Databricks. We are turning natural language prompts into our new, highly resilient data transformation logic.

The Problem with Traditional String Parsing

Before we look at the AI solution, let's define why the old way is failing us in modern data lakes.

Imagine you work for a retail company. You receive millions of customer feedback comments in a single text column. The business wants you to extract three things from this text: the product mentioned, the sentiment (Positive/Negative/Neutral), and a short summary of the core issue.

A typical row looks like this:

Example row: "I bought the SuperPhone X20 last Tuesday and the battery drains in 2 hours. Terrible experience."

To extract the product name using traditional PySpark, you would need a massive lookup table or a Regex pattern that tries to guess where the noun sits. What about the sentiment? You cannot "Regex" a sentiment. You would traditionally have to call an external Natural Language Processing (NLP) API, which means sending your data out of Databricks, waiting for the response, and joining it back. This adds massive latency, security complexities, and infrastructure overhead.

You end up with a pipeline that is complex, slow, and requires constant updating.

The Solution: ai_query and Unity Catalog

Databricks recognized this friction. With the introduction of AI functions, the goal is simple: bring the model to the data, not the data to the model.

ai_query is a native Databricks function evaluated directly by the Spark engine. It allows you to query Foundation Models (like Meta Llama 3.1, DBRX, or GPT-4o) hosted on Databricks Model Serving endpoints.

This is not a separate API call you have to manage in Python using the requests library. The models sit inside your secure Azure Databricks boundary, and access is governed strictly by Unity Catalog.

Use Case 1: Structured Extraction (SQL)

Let's solve the customer feedback problem. We want to extract the product name, sentiment, and issue.

Instead of writing complex UDFs (User Defined Functions), we will use Databricks SQL. We will instruct the LLM to read the text and return a strictly formatted JSON string. Notice the use of modelParameters => named_struct(...) — this is the correct, production-ready syntax to pass configurations to Mosaic AI endpoints.

SELECT
  feedback_id,
  raw_text,
  ai_query(
    'databricks-meta-llama-3-1-70b-instruct',
    CONCAT(
      'Analyze the following customer feedback. ',
      'Extract the product name, the sentiment (Positive/Negative/Neutral), ',
      'and a 3-word summary of the issue. ',
      'Return ONLY a valid JSON object with keys: "product", "sentiment", "issue". ',
      'Do not include any markdown formatting. ',
      'Feedback: ', raw_text
    ),
    -- CRITICAL: Ensure deterministic results
    modelParameters => named_struct('temperature', 0.0)
  ) AS ai_response
FROM
  bronze.customer_feedback
WHERE
  processed_flag = FALSE;

Once we have this ai_response column containing the JSON string, we simply use native Spark functions (from_json) to parse it into proper columns. With one SQL statement, we replaced an entire NLP pipeline.

Use Case 2: Data Standardization (PySpark)

Another massive headache for Data Engineers is standardizing formats.

Let's say you are ingesting legacy CRM data. The "Address" column is a free-text field with wild variations:

Example variations: 123 Main St, Apt 4B, New York, NY 10001Los Angeles California 90028, 456 Sunset Blvd

Writing SPLIT or Regex logic to parse street, city, state, and zip from these variations is practically impossible to do with 100% accuracy. Let's handle it in PySpark.

Note: We use %s inside the prompt template because we are passing it to Spark's format_string function, not Python's native string formatting.

from pyspark.sql.functions import col, expr, from_json

# Use %s for Spark's format_string
prompt_template = """
You are a data standardization bot. Parse the following address into components.
If a component is missing, return 'NULL'.
Format the output as a JSON object with keys: 'street', 'city', 'state', 'zip'.
Address: %s
"""

# Apply the ai_query using expr() and strictly setting temperature to 0.0
df_standardized = df_raw.withColumn(
    "parsed_address_json",
    expr(f"""
        ai_query(
            'databricks-meta-llama-3-1-8b-instruct',
            format_string('{prompt_template}', address),
            modelParameters => named_struct('temperature', 0.0)
        )
    """)
)

# Extract the JSON into a typed Schema
schema = "street STRING, city STRING, state STRING, zip STRING"

df_final = df_standardized.withColumn(
    "address_struct",
    from_json(col("parsed_address_json"), schema)
).select("id", "address", "address_struct.*")

display(df_final)

The LLM is smart enough to understand that "Los Angeles" is the city, even if it comes first in the string.

Production Best Practices: The Senior Engineer's Checklist

⚠️ Warning: If you run ai_query on a billion-row table without a strategy, your Azure bill will be your last act at the company. LLMs are not standard CPU functions — they are expensive and relatively slow.

  • Cost & Incremental Processing
    Never run AI functions on entire historical datasets. Use Delta Live Tables (DLT) or Spark Structured Streaming to ensure you only process new rows. Calculate your token costs before moving from Dev to Prod.
  • Semantic Filtering
    Before calling the LLM, use a cheap SQL LIKE or ilike filter. If you are looking to extract issues from "complaints," filter the data for keywords like "broken", "return", "worst", or "error" first. Only send the "interesting" rows to the expensive Foundation Model. Save your compute.
  • The "Dead Letter Queue" (DLQ) Pattern
    LLMs can "hallucinate" or return malformed JSON, even with strict prompting. Always validate the output. Check for Nulls: if from_json fails to parse the AI response, the resulting struct will be NULL. Do not fail the whole pipeline — route failed rows into a quarantine_table for manual review or a retry with a larger model.
  • Determinism is Mandatory
    Always set temperature = 0.0. In Data Engineering, we need idempotent pipelines. The same input must produce the exact same output every time the pipeline runs. If you leave the temperature at default, your tests will be flaky and your data will drift.

Setting Up Governance in Unity Catalog

Implementing this requires proper governance. You cannot just type ai_query and expect it to work.

Just like a Delta table, a Model Serving endpoint is governed by Unity Catalog. You must explicitly grant permission to the Service Principal or group running the data pipeline.

SQL- Grant the pipeline identity permission to use the model GRANT EXECUTE ON ENDPOINT databricks-meta-llama-3-1-70b-instruct  TO svc_prod_pipeline;

This ensures that only authorized production jobs can rack up token charges against your Azure Databricks billing account.

Conclusion: A Paradigm Shift in Data Cleaning

The introduction of ai_query does not mean Data Engineers are becoming Prompt Engineers. We are still responsible for the architecture, the performance, the orchestration, and the data modeling.

However, it does mean we can finally stop fighting with Regular Expressions for complex text extraction tasks. By treating Foundation Models as just another tool in our PySpark utility belt, we make our pipelines more resilient to upstream data changes, and we make our code significantly more readable.

The next time you find yourself about to spend three hours debugging a Regex string you copied from StackOverflow — stop. Write an AI prompt instead. Your future self, and the engineer who inherits your pipeline, will thank you.

Share this post
Data Engineering
Andrzej Gabryel
MORE POSTS BY THIS AUTHOR
Andrzej Gabryel

Curious how we can support your business?

TALK TO US