Search…
High Level Design · Part 11

Batch and stream processing

In this series (12 parts)
  1. Monolith vs microservices
  2. Microservice communication patterns
  3. Service discovery and registration
  4. Event-driven architecture
  5. Distributed data patterns
  6. Caching architecture patterns
  7. Search architecture
  8. Storage systems at scale
  9. Notification systems
  10. Real-time systems architecture
  11. Batch and stream processing
  12. Multi-region and global systems

Every night at 2am, a job wakes up, scans 500 million rows of click data, computes daily aggregates, and writes summary tables that power the analytics dashboard. That is batch processing. Meanwhile, a fraud detection system examines each credit card transaction as it happens, scoring it in under 100 milliseconds and blocking suspicious charges before the merchant even sees a response. That is stream processing.

Both are data processing. Both can compute the same results. But they operate on fundamentally different assumptions about time, completeness, and latency, and those assumptions determine which tools and architectures you reach for.

Batch processing

Batch processing operates on bounded, complete datasets. You know the input is finite: yesterday’s logs, last month’s orders, the entire user table. The job reads all the data, transforms it, and produces output. If it fails halfway through, you restart from the beginning.

The canonical batch framework is Apache Spark. At a conceptual level, Spark splits your data into partitions distributed across a cluster of worker nodes. You express transformations (map, filter, join, aggregate) as a directed acyclic graph (DAG) of operations. Spark’s scheduler optimizes this DAG, deciding which operations can be pipelined together and which require a shuffle (redistributing data across workers, typically for joins and group-by operations).

A Spark job processing 1TB of log data might run across 100 workers. Each worker reads its partition from distributed storage (HDFS or S3), applies the transformations locally, and writes intermediate results. Shuffles are the expensive part: they require writing data to disk and transferring it across the network. A well-optimized Spark job minimizes shuffles.

Batch processing excels when you need to reprocess historical data, compute aggregates over complete datasets, train machine learning models, or generate reports. The trade-off is latency: results are only available after the entire job finishes, which might take minutes to hours.

Stream processing

Stream processing operates on unbounded, never-ending data. Events arrive continuously: user clicks, sensor readings, financial transactions, log lines. The system processes each event (or small batch of events) as it arrives, producing results with sub-second to low-second latency.

Apache Flink and Kafka Streams are the two most prominent stream processing frameworks. Flink runs as a distributed cluster with a JobManager (coordinator) and TaskManagers (workers). Your processing logic is compiled into a dataflow graph that Flink executes across the cluster. Kafka Streams takes a different approach: it is a library embedded in your application, not a separate cluster. Your application instances form the processing cluster, reading from and writing to Kafka topics.

The key challenges in stream processing are event ordering, late arrivals, and state management. In a batch job, all data is present before processing starts, so ordering and completeness are trivial. In a stream, events can arrive out of order (due to network delays or partitioning), and you never know if all events for a given time window have arrived.

Windowing solves this partially. You define time windows (tumbling, sliding, or session-based) and aggregate events within each window. A tumbling window of 5 minutes collects all events in each 5-minute interval and emits a result at the end. But what if an event timestamped at 14:04:59 arrives at 14:06:01, after the window closed? Watermarks, a concept Flink popularized, help the system decide when it is safe to close a window by tracking the progress of event time across the stream.

Lambda architecture

The lambda architecture attempts to get the best of both worlds: the completeness and correctness of batch processing combined with the low latency of stream processing. It runs two parallel processing paths.

The batch layer processes the complete historical dataset periodically (say, every hour or every night). It computes exact results and writes them to a serving layer (a database or data warehouse). The speed layer processes the same data as a real-time stream, computing approximate or incremental results that are immediately available. The serving layer merges results from both layers at query time.

graph TD
  DS["Data Source"] --> BL["Batch Layer (Spark)"]
  DS --> SL["Speed Layer (Flink)"]
  BL -->|Complete, accurate, delayed| SV["Serving Layer"]
  SL -->|Approximate, real-time| SV
  SV --> Q["Query"]
  
  subgraph Batch Path
      BL --> DL["Data Lake (S3/HDFS)"]
      DL --> BL
  end

Lambda architecture with parallel batch and speed layers feeding a unified serving layer.

The appeal is clear: users see real-time updates from the speed layer, and those results are periodically corrected by the more accurate batch layer. A dashboard showing “total sales today” uses the speed layer for the last hour and the batch layer for everything before that.

The problem is also clear: you maintain two codebases doing essentially the same computation in two different frameworks. A bug in the batch logic that is not replicated in the stream logic (or vice versa) produces inconsistent results. Schema changes require updating both paths. Testing becomes harder because you are testing two systems that must produce compatible outputs.

Kappa architecture

The kappa architecture, proposed by Jay Kreps (co-creator of Kafka), simplifies lambda by eliminating the batch layer entirely. All data flows through a single stream processing pipeline. Historical reprocessing is handled by replaying the event log from the beginning.

The key insight is that if your event log is durable and replayable (which Kafka provides), you do not need a separate batch system. When you need to recompute historical results (because your logic changed, or you found a bug), you deploy a new version of your stream processor, point it at the beginning of the log, and let it consume all historical events at full speed. Once it catches up to the present, it switches to real-time processing.

graph TD
  DS["Data Source"] --> EL["Event Log (Kafka)"]
  EL --> SP["Stream Processor (Flink / Kafka Streams)"]
  SP --> SV["Serving Layer"]
  SV --> Q["Query"]
  
  EL -->|Replay from start| SP2["New Stream Processor Version"]
  SP2 -->|Catch up, then replace| SV

Kappa architecture with a single stream processing path and historical reprocessing through log replay.

Kappa is simpler to maintain because there is only one processing path. But it requires that your event log retains enough history for reprocessing (Kafka’s retention policy must be long enough, or you use tiered storage). It also assumes your stream processor can handle the throughput of consuming years of historical data in a reasonable time.

Lambda vs kappa: when each fits

The choice depends on your workload characteristics and organizational constraints.

Lambda fits when you need the batch layer for complex computations that stream processors handle poorly (large joins across historical datasets, training ML models), when your team already has batch infrastructure, or when the cost of maintaining two systems is acceptable given the reliability of having a batch-computed “ground truth.”

Kappa fits when your events are naturally ordered and append-only, when your stream processor can handle both real-time and historical reprocessing throughput, and when simplicity of a single codebase matters more than the flexibility of separate processing paths. Most modern systems lean toward kappa because stream processing frameworks have become powerful enough to handle workloads that previously required batch.

Exactly-once semantics

Both batch and stream systems need to handle failures without producing duplicate or missing results. Batch jobs achieve this through idempotent writes and restart-from-scratch semantics. If a Spark job fails, you delete the partial output and rerun. The complete input is still available in storage.

Stream processing is harder because the input is continuous. If a Flink worker crashes after processing event 1000 but before checkpointing, the replacement worker will reprocess events starting from the last checkpoint. If those events triggered side effects (writing to a database, sending a notification), you get duplicates.

Flink solves this with distributed snapshots (based on the Chandy-Lamport algorithm). Periodically, the system injects barrier markers into the stream. When a worker receives a barrier, it snapshots its local state to durable storage. On recovery, the system restores the snapshot and replays events from the barrier position. Combined with transactional sinks (sinks that commit output atomically with the checkpoint), Flink achieves effectively-exactly-once processing.

Kafka Streams uses a similar approach but relies on Kafka’s transactional producer to atomically write output records and consumer offsets. If a failure occurs, the uncommitted transaction is rolled back and reprocessed.

Practical patterns

Several patterns recur across batch and stream systems.

Event sourcing stores every state change as an immutable event in a log. The current state is derived by replaying events. This pattern naturally fits stream processing because the event log is the source of truth, and any derived view can be recomputed by replaying.

CQRS (command query responsibility segregation) separates write and read models. Writes append events to a log. Read models are materialized views built by stream processors consuming that log. Different read models can serve different query patterns without affecting the write path.

Change data capture (CDC) bridges the gap between traditional databases and stream processing. Tools like Debezium read a database’s transaction log and publish each row change as an event to Kafka. Downstream stream processors can then react to database changes in near real-time without the application needing to publish events explicitly.

graph LR
  DB["Primary Database"] -->|Transaction Log| CDC["CDC (Debezium)"]
  CDC --> K["Kafka Topics"]
  K --> SP1["Stream Processor: Search Index"]
  K --> SP2["Stream Processor: Analytics"]
  K --> SP3["Stream Processor: Cache Invalidation"]
  SP1 --> ES["Elasticsearch"]
  SP2 --> DW["Data Warehouse"]
  SP3 --> RC["Redis Cache"]

Change data capture feeding multiple stream processors from a single database transaction log.

What comes next

Batch and stream processing handle data within a single region. But what happens when your system spans continents? The next article covers multi-region and global systems: active-passive vs active-active deployments, data sovereignty, latency-based routing, and disaster recovery planning.

Start typing to search across all content
navigate Enter open Esc close