Search…

Design an ad click aggregation system

In this series (18 parts)
  1. Design a URL shortener
  2. Design a key-value store
  3. Design a rate limiter
  4. Design a web crawler
  5. Design a notification system
  6. Design a news feed
  7. Design a chat application
  8. Design a video streaming platform
  9. Design a music streaming service
  10. Design a ride-sharing service
  11. Design a food delivery platform
  12. Design a hotel booking platform
  13. Design a search engine
  14. Design a distributed message queue
  15. Design a code deployment system
  16. Design a payments platform
  17. Design an ad click aggregation system
  18. Design a distributed cache

The core problem

Every time a user clicks an ad, an advertiser expects to be billed accurately. At the scale of a large ad network, that means ingesting 10 billion click events per day, deduplicating them within seconds, and producing aggregated counts that advertisers and publishers trust enough to pay real money against. Getting this wrong by even 0.1% means millions of dollars in disputed charges.

The system sits at the intersection of batch and stream processing and message queues. It must serve real-time dashboards while also producing reconciled daily totals that hold up under audit.


Requirements

Functional

  • Ingest click events from ad-serving infrastructure in real time.
  • Deduplicate clicks from the same user on the same ad within a configurable window (default 1 minute).
  • Aggregate clicks by ad_id, campaign_id, publisher_id, and time granularity (1 min, 1 hour, 1 day).
  • Serve aggregated counts to advertiser dashboards with less than 10 seconds of staleness.
  • Produce reconciled daily totals for billing, correcting any streaming inaccuracies.
  • Support querying historical aggregation data for at least 2 years.

Non-functional

  • Scale: 500M DAU, 10B click events/day, peak of 400K clicks/second.
  • Latency: Aggregated counts visible within 10 seconds of the click.
  • Accuracy: Reconciled daily totals must be within 0.01% of ground truth.
  • Availability: 99.99% uptime. Advertisers monitor spend continuously.
  • Privacy: No raw user-level click logs exposed to advertisers. Only aggregated counts.

Capacity estimation

A single click event contains ad_id, user_id, timestamp, ip, user_agent, and referrer. That is roughly 500 bytes per event.

MetricValue
Daily events10 billion
Average QPS~115K clicks/sec
Peak QPS~400K clicks/sec
Raw storage per day10B x 500B = 5 TB
Raw storage per year~1.8 PB
Aggregated rows per day~50M (across all dimensions)
Aggregated storage per year~20 TB

We keep raw logs for 7 days in hot storage for reconciliation, then move them to cold storage. Aggregated data stays queryable for 2 years.

Network bandwidth: 400K events/sec x 500 bytes = 200 MB/sec inbound at peak. Add serialization overhead and replication (3x for Kafka), and we need roughly 600 MB/sec of internal network bandwidth dedicated to the click pipeline.

Dedup cache sizing: At 400K clicks/sec with a 2-minute window, we store up to 48M fingerprints. Each fingerprint is a 16-byte hash plus 8 bytes of metadata. Total: 48M x 24B = ~1.15 GB. A single Redis node can hold this, but we shard across 10 nodes for throughput.


High-level architecture

graph LR
  A[Ad Servers] -->|Click Events| B[API Gateway]
  B --> C[Kafka Cluster]
  C --> D[Stream Processor<br/>Flink]
  C --> E[Raw Log Store<br/>S3 / HDFS]
  D --> F[Aggregation Store<br/>Druid / ClickHouse]
  E --> G[Batch Processor<br/>Spark]
  G --> F
  F --> H[Query Service]
  H --> I[Advertiser Dashboard]
  H --> J[Billing Service]
  D --> K[Dedup Cache<br/>Redis Cluster]

High-level architecture showing the dual streaming and batch paths converging on a shared aggregation store.

Click events flow from ad servers through an API gateway into a Kafka cluster. From there, two paths diverge: a streaming path through Flink for real-time aggregation, and a batch path through Spark for daily reconciliation. Both write to the same aggregation store. This is the lambda architecture applied to ad tech.

The API gateway handles authentication, rate limiting, and initial validation. Invalid events (missing ad_id, malformed timestamps) get dropped here. At 400K events/sec peak, this layer needs at least 20 gateway instances behind a load balancer, each handling 20K requests/sec.

Kafka serves as the durable buffer between ingestion and processing. We partition the click topic by ad_id so that all clicks for a given ad land on the same partition, which simplifies downstream deduplication. With 10B events/day and a 7-day retention, we need roughly 35 TB of Kafka storage. A cluster of 30 brokers with 2 TB SSDs each covers this with headroom.


Deep dive: deduplication

Duplicate clicks are the single biggest source of billing disputes. A user double-clicks an ad, a bot retries a request, or a network partition causes the ad server to resend. We need to catch all of these.

Streaming dedup

The stream processor maintains a sliding window of recent click fingerprints. Each fingerprint is a hash of (user_id, ad_id, timestamp_bucket) where the timestamp bucket is the minute the click occurred. We store these fingerprints in a caching layer using Redis with a TTL of 2 minutes.

graph TD
  A[Click Event Arrives] --> B[Compute Fingerprint<br/>hash of user_id + ad_id + minute]
  B --> C{Bloom Filter Check}
  C -->|Possibly Exists| D{Exact Check in Redis}
  C -->|Definitely New| E[Write to Redis + Bloom Filter]
  D -->|Found| F[Drop as Duplicate]
  D -->|Not Found| E
  E --> G[Forward to Aggregation]
  G --> H[Update Minute-Level Counter]
  H --> I[Roll Up to Hour-Level Counter]

Streaming deduplication flow with a two-tier check. A Bloom filter reduces Redis lookups by 60-70%, and Redis provides exact matching for billing accuracy.

At 400K clicks/sec, the Redis cluster needs to handle up to 800K operations/sec in the worst case (one read, one conditional write per click). The Bloom filter front-end reduces this to roughly 300K Redis ops/sec since most clicks are unique and the Bloom filter confirms their novelty without touching Redis. A cluster of 10 Redis nodes with 64GB RAM each handles this comfortably.

The false positive rate matters. If the Bloom filter says “possibly exists” for a truly unique click, we pay an extra Redis lookup but still get the correct answer. If Redis itself were replaced with a Bloom filter, we would risk dropping legitimate clicks. For billing-critical systems, exact matching with Redis is the safer default. The Bloom filter is strictly an optimization layer, never a decision layer.

Batch dedup

The batch path processes raw logs from S3 once per hour. It performs exact deduplication using a sort-merge approach: sort all events by (user_id, ad_id, timestamp), then scan linearly and drop duplicates within the 1-minute window. This catches any duplicates that slipped through the streaming path due to Redis cache eviction, network partitions, or Flink checkpoint restores.

The batch path also catches a subtle failure mode: late-arriving events. If a click event arrives 5 minutes late due to network delays, the streaming dedup window may have expired. The batch path, operating on complete hourly logs, has full visibility and deduplicates correctly.


Deep dive: lambda architecture for reconciliation

The streaming path gives us speed. The batch path gives us accuracy. The reconciliation layer merges them.

graph TD
  subgraph Speed Layer
      A[Kafka] --> B[Flink Stream Processor]
      B --> C[Minute Aggregates]
      C --> D[Hour Aggregates<br/>Approximate]
  end

  subgraph Batch Layer
      E[S3 Raw Logs<br/>Immutable] --> F[Spark Hourly Job]
      F --> G[Hour Aggregates<br/>Exact]
  end

  D --> H{Reconciliation<br/>Comparator}
  G --> H
  H -->|Delta < 0.01%| I[Accept Stream Counts]
  H -->|Delta >= 0.01%| J[Replace with Batch Counts]
  I --> K[Billing Store]
  J --> K
  K --> L[Correction Events to Dashboard]

Lambda architecture for reconciliation. Streaming counts are accepted when they match batch totals within tolerance. Batch counts override when the delta exceeds 0.01%.

The reconciliation comparator runs hourly. It takes the streaming hour-level aggregates and the batch hour-level aggregates, computes the delta for each (ad_id, hour) pair, and decides which to trust.

In practice, the streaming path is accurate within 0.005% for 99% of ad campaigns. The remaining 1% typically involve high-volume campaigns where Redis eviction or Flink restarts caused missed deduplication. When batch counts override streaming counts, downstream consumers (dashboards, billing) receive a correction event. The dashboard shows a small adjustment indicator so advertisers understand the number changed.

Why not just use batch?

Advertisers want real-time spend visibility. A pure batch system with hourly updates means an advertiser could overspend their daily budget by an hour’s worth of clicks before seeing the total. At $50 CPM for premium placements, one hour of unchecked spend on a viral campaign can burn through thousands of dollars.

Why not just use streaming?

Stream processors can lose state during failures. Flink checkpoints mitigate this, but a checkpoint restore after a crash can replay events and produce overcounts. The batch path, processing immutable logs from S3, provides the ground truth that no streaming system can match.


Deep dive: aggregation store and sharding

The aggregation store holds pre-computed counts at multiple time granularities. We need fast writes from both streaming and batch paths, and fast reads for dashboard queries.

ClickHouse or Apache Druid are natural fits here. Both support high-throughput columnar ingestion and sub-second analytical queries. We partition data by time (daily partitions) and shard by ad_id using consistent hashing via database sharding.

With 10 shards, each shard handles roughly 5M aggregated rows per day. Reads are straightforward since dashboard queries always include ad_id or campaign_id, which maps directly to a shard.

For hot campaigns (top 1% by volume that generate 30% of all clicks), we add a write buffer. Instead of incrementing the counter on every click, we buffer 1,000 clicks in memory and issue a single atomic increment. This reduces write amplification by 1000x for these campaigns and prevents hot-shard problems.

The write buffer introduces a trade-off: dashboard counts may lag by up to the buffer flush interval (typically 1 second). For a system that already tolerates 10 seconds of staleness, this is acceptable.

The chart shows a typical daily traffic pattern. Peak throughput at noon UTC (US morning, EU afternoon) is nearly 10x the trough at 03:00 UTC. The system must handle the peak without degradation, but auto-scaling the batch processing layer during off-peak hours saves significant compute cost. Provisioning for peak means most machines sit idle at 3 AM. Spot instances for the batch layer offset this waste.


Trade-offs and alternative approaches

Exactly-once vs at-least-once: Kafka with Flink supports exactly-once semantics through transactional producers and two-phase commit. This adds 10-15% latency overhead. Most ad systems choose at-least-once with downstream deduplication because it is simpler to operate and the dedup layer already exists for business reasons.

Pre-aggregation in Kafka Streams vs external processor: Kafka Streams can perform lightweight aggregation within the broker cluster, eliminating the need for a separate Flink deployment. The trade-off is reduced flexibility. Kafka Streams handles simple group-by-count well but struggles with complex windowed joins needed for cross-campaign fraud detection.

Time-series DB vs OLAP engine: TimescaleDB or InfluxDB can store aggregated counts efficiently. However, OLAP engines like ClickHouse handle the multi-dimensional slicing (filter by campaign, group by publisher, aggregate by hour) that advertiser dashboards demand. For this use case, OLAP wins.

Privacy and differential privacy: Differential privacy can be applied to aggregated counts before exposing them to advertisers. Adding calibrated noise (Laplace mechanism with epsilon=1.0) ensures individual user clicks cannot be inferred from aggregated data. The cost is roughly 0.5% accuracy loss on small campaigns (under 1,000 daily clicks). Large campaigns with millions of clicks are virtually unaffected. Regulations like GDPR increasingly push systems toward this kind of formal privacy guarantee.

Kappa architecture (streaming only): Some teams skip the batch layer entirely and rely on Flink’s checkpoint mechanism plus a replayable log (Kafka with long retention) as the source of truth. This simplifies operations but makes reconciliation harder. If Flink state corrupts, you reprocess from Kafka, which can take hours for a full day of data at this scale.


What real systems actually do

Google Ads uses Mesa, a geo-replicated OLAP system built on top of BigTable. Mesa provides exactly the reconciliation pattern described here: streaming updates for real-time counts, batch updates for corrections, and atomic version swaps when batch results are ready. Mesa achieves cross-datacenter consistency by treating each batch version as an immutable snapshot.

Meta’s ad aggregation runs on Scuba for real-time queries and Hive/Spark for batch reconciliation. They handle deduplication at the edge using probabilistic data structures and reconcile in batch. Their system processes trillions of events per day across the full ads pipeline (impressions, clicks, conversions).

Smaller ad networks often start with a simpler architecture: Kafka into ClickHouse with a periodic Spark job for reconciliation. This handles up to 1B events/day on modest infrastructure (20-30 nodes). Beyond that, you need the full lambda setup or a managed service like Google Cloud Dataflow.

The common thread across all of these: nobody trusts streaming counts alone for billing. Every production system has a batch reconciliation step.


What comes next

This design assumes click events are the only signal. Real ad systems also track impressions (view events), conversions (purchases after clicking), and attribution windows (did the user buy within 7 days of clicking?). Each of these adds complexity to the aggregation pipeline, particularly attribution, which requires joining click streams with conversion streams across time windows that can span days.

Fraud detection is another dimension entirely. Click fraud (bots generating fake clicks) can be detected by analyzing click patterns: abnormal click rates from a single IP, clicks without corresponding impressions, or geographic anomalies where the click originates from a different country than the user’s profile. This typically runs as a separate streaming pipeline that feeds a fraud score back into the aggregation layer, filtering out fraudulent clicks before they reach the billing store.

If you want to go deeper on the infrastructure that powers the ingestion layer, start with message queues and batch and stream processing. For the storage layer, database sharding covers the partitioning strategies used in the aggregation store, and caching explains the patterns behind the Redis dedup layer.

Start typing to search across all content
navigate Enter open Esc close