Designing Scalable Data Pipelines: Batch, Streaming, and Layered Architectures

Michal Milosz
Michal Milosz
June 13, 2025
16 min read
Loading the Elevenlabs Text to Speech AudioNative Player...

Modern data-driven organizations face the challenge of processing ever-increasing volumes of information from both historical (batch) and real-time (streaming) sources. Scalable data pipelines are crucial to managing the volume, velocity, and variety of this data while delivering timely and actionable insights.

Broadly, batch processing involves the periodic handling of large datasets - such as generating nightly reports - prioritizing throughput and comprehensive analysis, though with higher latency. In contrast, stream processing handles data in real time, enabling low-latency analytics but often struggling with complex historical queries.

In real-world systems, the strengths of both paradigms are often needed. To address this, architects have proposed hybrid data pipeline architectures - notably the Lambda Architecture, Kappa Architecture, and Medallion Architecture - each offering a unique approach to blending batch and streaming workflows.

This article explores these three architectures in depth, illustrating their data flows, examining their advantages and limitations, and helping you determine when and why to choose each based on your data needs and operational goals.

The Lambda Architecture

The Lambda Architecture (introduced by Nathan Marz) splits data processing into two parallel paths: a batch layer and a speed (real-time) layer, with a serving layer that merges their outputs for querying. Incoming data is ingested—often via a distributed log like Kafka—and simultaneously fed into both the batch and speed pipelines.

The batch layer ingests immutable raw data and periodically computes comprehensive, precomputed “batch views” (e.g., using Hadoop or Spark).

The speed layer processes live data streams (e.g., using Storm or Flink) to produce low-latency “real-time views.”

The serving layer then unifies these results: queries first consult the up-to-date but incremental speed layer, and fall back on the slower batch layer for full accuracy.

Figure: Lambda architecture (unified serving layer) – data flows in parallel to batch and real-time layers before merging into a serving layer.

This design lets Lambda balance low-latency results with complete historical accuracy. The batch layer’s immutability and ability to reprocess the entire data set ensures correctness and fault-tolerance (it can recalc from raw data if logic changes)​. Meanwhile, the speed layer provides near-real-time updates (often measured in milliseconds)​. By using different technologies in each layer, Lambda can support diverse query engines and data formats – for example, one could use Spark for batch analytics and a streaming engine for instant updates​. Importantly, Lambda was designed to handle massive scale: Marz built it to “handle massive quantities of data” by leveraging both batch and stream processing​.

Advantages of Lambda Architecture:

  • Accuracy + Freshness - Achieves high data quality by combining complete historical views (batch) with real-time views (speed). The batch layer can re-compute analytics on all data to correct errors or accommodate schema changes.
  • Fault Tolerance - Immutable batch data and replayable streams ensure no data loss.
  • Flexible Tooling - Different tools can optimize each layer (e.g., Hadoop/Spark for heavy batch workloads, Flink/Storm for streaming).
  • Scalability - Each layer can be scaled independently (e.g., horizontal scaling of Spark clusters vs. streaming clusters).

Drawbacks of Lambda Architecture:

  • Increased Complexity - Two separate codebases/pipelines must be maintained—one for batch logic and one for streaming logic—often duplicating business logic. This doubles the maintenance burden and integration work.
  • Latency of Reconciliation - There is inherent complexity in merging batch and real-time results. Queries must handle both views consistently.
  • Slower “True” Real-Time - Even though the speed layer provides quick updates, the final “correct” result may only be available after the next batch run—which can be minutes or hours later.
  • Operational Overhead - Running and coordinating multiple pipelines and storage systems increases operational burden.

When to Use Lambda:

This pattern shines when you must combine real-time insights with reliable historical analytics. For example, an adtech platform might use Lambda to serve up-to-the-second counters (via the speed layer) while still periodically recomputing totals for accuracy.
Lambda is well-suited when both low latency and high data completeness are critical - and when the team has the capacity to manage its complexity.
If ultra-low end-to-end latency is paramount, or if the team prioritizes simplicity, an alternative like the Kappa architecture may be a better fit.

The Kappa Architecture

The Kappa Architecture (proposed by Jay Kreps) simplifies Lambda by eliminating the separate batch layer altogether. In Kappa, all incoming data is written to an append-only event log (e.g. Kafka) which serves as the one source of truth. A streaming engine (e.g. Kafka Streams, Flink, Spark Streaming) then continuously processes that log to update views and feeds any downstream systems​. Re-computation or backfills are handled by replaying the log with updated logic, instead of maintaining a parallel batch pipeline​. In effect, “batch” processing is just a special case of stream processing in Kappa.

Figure: Kappa architecture – a single real-time pipeline (blue) streams data into one system, serving both live and batch queries.

Kappa’s core premise is a unified streaming pipeline: use one stack (events + stream processing) for all workloads. This means the code is written once and operates on the same data path whether data is fresh or “old” – eliminating Lambda’s dual-code problem​. Because data is always processed through a streaming engine, Kappa naturally delivers very low-latency results and treats every event uniformly. The event log can be queried by both real-time applications and by periodic analytics jobs (just replaying history).

Advantages of Kappa architecture:

  • Simplicity and Maintainability: Only one pipeline to develop and debug. Teams write a single processing job rather than duplicating logic in batch and speed layers​. This greatly simplifies testing and deployment.
  • Low Latency: Since all processing is stream-based, results can be updated in real-time (millisecond to second latency) for all data.
  • Consistency and Data Quality: When built on a robust event platform (like Kafka with exactly-once semantics), Kappa provides strong guarantees – events are ordered and processed exactly once – which can improve data quality and eliminate mismatches between views​.
  • Unified Scaling: Scaling a stream platform (e.g. partitioning topics) automatically scales both real-time and reprocessing workloads.

Drawbacks of Kappa architecture:

  • Reprocessing Overhead: Replaying the entire log for backfills can be time-consuming and requires storing all historical data in the streaming system (which can be costly).
  • Learning Curve: Teams must be proficient with stream-processing paradigms (state management, fault tolerance in streams, etc.)​.
  • Less Separation of Concerns: Since there is no separate batch view, exceptionally heavy queries on historical data may compete with real-time processing, unless carefully managed.
  • Not Magic: Kappa is not universally better; it shines when streaming is the dominant mode. For workloads that truly need heavy, infrequent batch analytics, a traditional batch system might still be preferable.

When to use Kappa: Kappa is ideal for event-driven, real-time-centric applications. Examples include IoT telemetry, user interaction streams, and applications where fresh insights are critical (fraud detection, recommendation engines, monitoring). It suits scenarios where you can ingest all data into a streaming platform (Kafka/Pulsar) and process it on the fly. Kappa is also attractive if your team wants one codebase and wants to avoid Lambda’s complexity​. In practice, many organizations use a Kappa-style approach for streaming analytics and still occasionally run a batch job for specialized reports.

Medallion Architecture (Bronze/Silver/Gold)

The Medallion Architecture (sometimes called multi-hop or Bronze–Silver–Gold) is a layered data design pattern popularized by Lakehouse platforms (e.g. Databricks). Rather than focusing on batch vs. stream, medallion emphasizes data quality and organization through successive refinement stages​.

Figure: Medallion architecture (Bronze/Silver/Gold layers) – raw data is ingested into the Bronze layer, progressively cleaned/conformed in Silver, and refined into analytics-ready tables in Gold.
  • In the Bronze layer, data is ingested “as-is” from sources (streaming or batch). All raw events or records are landed here, often with only minimal parsing and metadata (e.g. timestamps)​. Bronze serves as a historical archive and ensures raw lineage is preserved.
  • The Silver layer applies data cleaning, deduplication, and schema conformance. Data from Bronze is transformed into consistent, canonical forms (e.g. resolving different source formats, filtering out corrupt records). The Silver tables provide an enterprise-wide view of key entities (customers, transactions, etc.)​. In other words, Silver conforms data into a trusted “single source” suitable for cross-functional analysis.
  • The Gold layer contains fully curated, consumable datasets – often dimensional or aggregated tables optimized for reporting and ML. Here, Silver data is further enriched and modeled (e.g. star schemas, time-series aggregates) with all business rules applied. Gold tables are what BI dashboards and ML models consume.

The medallion design ensures incremental improvement of data quality​. As data “flows” downstream, each layer adds validation and structure. Importantly, this pattern is technology-agnostic: it can be implemented purely in batch ETL, purely in streaming, or a hybrid. For example, Bronze ingestion might use Kafka or cloud storage (even streaming writes), Silver might run Spark jobs (streaming or batch), and Gold might materialize tables via SQL or ML pipelines​. The key idea is layering.

Advantages of Medallion architecture:

  • Data Quality and Governance: By isolating raw, cleansed, and curated data, medallion architectures make data validation and lineage clear. Downstream consumers can trust Gold data for analysis, knowing Silver enforced schema rules and Bronze preserved raw audit trails​.
  • Modularity and Maintainability: Each layer has a clear role and can be managed incrementally. Teams can focus on small transformations at each stage instead of one monolithic ETL. This separation simplifies debugging and schema evolution.
  • Scalability: Built on scalable data lakehouse tools (Delta Lake, Spark, etc.), medallion pipelines can handle huge volumes. By using cloud storage (e.g. S3/ADLS) and distributed compute, each layer can scale elastically.
  • Flexibility (Batch/Streaming): Medallion does not prescribe batch vs stream; modern implementations allow both. For instance, bronze tables can be fed by streaming ingestion, and Silver can be updated with streaming Structured Streaming, making the entire pipeline near-real-time.
  • Enterprise Integration: Because Silver produces a unified “enterprise view,” different business units can build their own Gold tables for specific use cases. This aligns with concepts like data mesh or governed data products​.

Drawbacks of Medallion architecture:

  • Storage and Overhead: Storing multiple copies of data (raw, cleansed, curated) increases storage costs. Each transformation can duplicate data, though cloud costs mitigate this to some extent.
  • Pipeline Complexity: More layers mean more moving parts (jobs, tables, schemas) to manage. Achieving end-to-end quality requires discipline in building each step.
  • Not a Silver Bullet: Medallion architecture is specifically suited to lakehouse environments. Organizations with legacy data warehouses or no streaming may not adopt it easily. It also assumes a culture that can operationalize multi-step pipelines.

When to use Medallion: Use it when you need robust data curation across an organization, especially in a Lakehouse or big data environment​. It’s ideal for teams that want to enforce data quality and deliver clean, reliable tables to analysts and data scientists. Medallion architecture complements either Lambda or Kappa styles – for example, you might build a Lambda pipeline whose output is written into bronze/silver/gold tables. The philosophy is that each layer incrementally improves data – as one blog puts it, “the main purpose of Medallion architecture is to structure and refine data through multiple stages for better quality and usability”​.

Scalability, Maintainability, and Data Quality

Choosing among these architectures depends on your priorities. All are designed for big data scale, but they differ in trade-offs:

  • Scalability: Lambda scales by parallelizing batch and streaming and allows each to grow independently. Kappa relies on a streaming backbone; platforms like Kafka and Flink are highly scalable but shift the burden of history into the event log. Medallion scales by leveraging distributed file systems and compute (Delta Lake, Spark), allowing large volume ingestion and processing across layers. In practice, all three can handle petabytes of data – the key is matching the architecture to workload patterns.
  • Maintainability: Kappa is generally simplest to maintain (one pipeline codebase). Lambda is more complex due to its dual pipelines, which doubles development and testing work. Medallion’s maintainability depends on good data engineering practices – while it has more tables and steps, each step is smaller and (in a good Lakehouse) declaratively managed, which many teams find modular and traceable. A medallion pipeline may actually be easier to reason about end-to-end since each hop has a clear purpose.
  • Data Quality: Both Lambda and Medallion explicitly aim for high data quality. Lambda’s batch layer ensures correctness, “correcting” any speed-layer errors or latency gaps. Kappa can achieve high quality if the streaming platform provides exactly-once semantics and strong ordering guarantees​, but it requires careful design (e.g. idempotent sinks). Medallion architecture bakes in quality by design: each layer enforces additional checks and transformations​. For example, Silver might quarantine bad records, and Gold enforces business rules. In short, Lambda and Medallion offer explicit correction mechanisms, while Kappa relies on the inherent correctness of continuous processing.

In summary

Lambda is best for mixed workloads needing both immediate and accurate results; Kappa is best for purely streaming, event-driven workloads where simplicity and freshness rule; and Medallion is best when data quality and organized refinement are the goal (especially in a Lakehouse). These patterns can be combined (e.g., a Lambda pipeline feeding a medallion-design data lake), but understanding their principles helps architects make informed decisions. By aligning the architecture with your data velocity needs and quality requirements, you can build pipelines that scale gracefully, are easier to maintain, and deliver trusted insights.

Share this post
Data Engineering
Michal Milosz
MORE POSTS BY THIS AUTHOR
Michal Milosz

Curious how we can support your business?

TALK TO US