The "Shift Left" Revolution: Why Your PySpark Pipelines Need Unit Tests (and How to Do It)

Michal Milosz
Michal Milosz
February 16, 2026
9 min read
Loading the Elevenlabs Text to Speech AudioNative Player...

In the data engineering world, we have a bad habit. We like to test in production.

We write some SQL or PySpark code, we click "Run" in the Databricks notebook, and if we see green checkmarks, we say: "It works!" Then we schedule it, go home, and hope for the best.

But "It runs" does not mean "It works."

Does the logic handle null values correctly? What happens if the input is empty? Did that complex join actually filter the duplicates, or did it just look like it did?

Software Engineers solved this problem 20 years ago. They call it "Shift Left" - moving testing to the earliest possible stage of development. In 2026, it is time for Data Engineers to stop acting like cowboys and start acting like Software Engineers.

Today, I want to talk about Unit Testing in PySpark, how to save cloud costs by testing locally, and how to automate it all with Databricks Asset Bundles (DABs) and Azure DevOps.

The Problem: "Integration Testing" is Expensive

Most data teams rely on Integration Tests. This means running the full pipeline on real (or sampled) data in the cloud.

The problem? It is slow and expensive. To test a simple logic change in a Gold table, you often have to:

  1. Spin up a cluster (5 minutes).
  2. Read data from ADLS.
  3. Run the whole job.
  4. Check the output.

If you have a bug, you fix it and wait another 10 minutes. This feedback loop is too long. And worse, you are paying for Databricks compute just to check if 1 + 1 = 2.

The Solution: Unit Testing PySpark

Unit testing is different. It doesn't test the pipeline. It tests the transformation logic.

To do this effectively, we need to change how we write code. We need to stop writing massive "spaghetti code" notebooks and start writing Pure Functions.

The "Spaghetti" Way (Bad)

# A typical notebook cell
df = spark.read.table("sales")
df_clean = df.filter(col("amount") > 0).withColumn("tax", col("amount") * 0.23)
df_clean.write.save("clean_sales")

This is impossible to unit test because the logic (filtering/math) is mixed with I/O (reading/writing). You need a database to run this.

The "Shift Left" Way (Good)

# logic.py
def calculate_tax_and_filter(df: DataFrame) -> DataFrame:
    return df.filter(col("amount") > 0).withColumn("tax", col("amount") * 0.23)

Now, I can test calculate_tax_and_filter without reading any files and without even having a Databricks cluster running.

The Toolset: Pytest, Chispa, and Local Spark

To build a robust testing suite, we need three tools.

1. Pytest

This is the industry standard for Python testing. It finds your tests, runs them, and gives you a pretty report.

2. Chispa

This is a library strictly for PySpark testing (created by the guy behind Delta Lake). PySpark DataFrames are complex objects; you can't just say assert df1 == df2. Chispa gives us beautiful assertions:

assert_df_equality(actual_df, expected_df)

If they don't match, it shows you exactly which row and column is different.

3. Local Spark (Mocking the Session)

You do not need a 10-node cluster to test a function. You can install pyspark on your laptop (or CI agent) and run a "Local Spark Session." It runs in memory, costs $0 in cloud credits, and starts in 3 seconds.

A Real-World Example: Testing a Transformation

Let’s say we have a function that classifies customers. Here is how I would write a test for it using pytest and chispa.

import pytest
from chispa.dataframe_comparer import assert_df_equality
from my_pipeline.transformations import classify_customer
# This fixture creates a tiny Spark session on your laptop
@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local").appName("test").getOrCreate()
def test_classify_customer_logic(spark):
    # 1. Prepare Mock Data (No ADLS needed!)
    data = [("Alice", 100), ("Bob", 0), ("Charlie", -50)]
    schema = ["name", "spent"]
    source_df = spark.createDataFrame(data, schema)
    # 2. Define Expected Output
    expected_data = [("Alice", 100, "Active"), ("Bob", 0, "Inactive")]
    expected_schema = ["name", "spent", "status"]
    expected_df = spark.createDataFrame(expected_data, expected_schema)
    # 3. Run the Logic
    actual_df = classify_customer(source_df)
    # 4. Assert
    assert_df_equality(actual_df, expected_df, ignore_row_order=True)

Notice what happened here?

  • No connection to Azure.
  • No waiting for cluster start.
  • We tested edge cases (negative values) instantly.

Integrating with CI/CD: The "Guardrail"

Now that we have tests, we need to enforce them. We don't want to rely on developers remembering to run pytest on their laptops.

This is where Azure DevOps and Databricks Asset Bundles (DABs) come in.

The Workflow

  1. Pull Request: A developer creates a PR in Azure DevOps.
  2. The Pipeline Triggers: Before any code is deployed to Databricks, the Azure DevOps agent installs pyspark + pytest.
  3. Run Tests: It executes pytest tests/.
  4. Block or Pass:
    • If 100% of tests pass -> The PR can be merged.
    • If 1 test fails -> The PR is blocked. You cannot deploy bugs.

This is the definition of "Shift Left". We caught the bug in the Pull Request, not in the Production Job.

Coverage Reports

To impress your manager, add pytest-cov. It generates a report showing exactly what percentage of your code is covered by tests. You can display this directly in the Azure DevOps dashboard. "Hey boss, our pipeline has 94% test coverage." That sounds a lot better than "I think it works."

Deployment with Databricks Asset Bundles (DABs)

Once the tests pass, how do we deploy?

In the old days, we copied notebooks manually. Today, we use DABs. DABs allow us to define our jobs, clusters, and libraries in yaml files alongside our python code.

Because our logic is now in proper Python files (modules) instead of Notebooks, DABs can package them into a .whl (Wheel) file and upload them to the cluster automatically.

Your databricks.yml might look like this:

bundle:
  name: sales_pipeline

resources:
  jobs:
    daily_sales:
      tasks:
        - task_key: main_task
          python_wheel_task:
            package_name: my_pipeline
            entry_point: run_pipeline

This treats your data pipeline exactly like a software application. Versioned, tested, and packaged.

Summary: Is it worth the effort?

I hear you asking: "Is this over-engineering? I just want to move data."

If you are a solo developer maintaining one table, maybe. But if you are working in a team, building a Data Mesh, or managing critical financial data, this is mandatory.

The "Shift Left" approach:

  1. Saves Money: You catch bugs before they burn cloud compute.
  2. Saves Sanity: You can refactor code without fear of breaking hidden logic.
  3. Builds Trust: The business knows the data is verified before it even lands in the dashboard.

We are Data Engineers. Let’s start engineering like we mean it.

Share this post
Data Engineering
Michal Milosz
MORE POSTS BY THIS AUTHOR
Michal Milosz

Curious how we can support your business?

TALK TO US