Search…

Design a food delivery platform

In this series (18 parts)
  1. Design a URL shortener
  2. Design a key-value store
  3. Design a rate limiter
  4. Design a web crawler
  5. Design a notification system
  6. Design a news feed
  7. Design a chat application
  8. Design a video streaming platform
  9. Design a music streaming service
  10. Design a ride-sharing service
  11. Design a food delivery platform
  12. Design a hotel booking platform
  13. Design a search engine
  14. Design a distributed message queue
  15. Design a code deployment system
  16. Design a payments platform
  17. Design an ad click aggregation system
  18. Design a distributed cache

Food delivery looks like ride-sharing with an extra stop. It is not. A ride has two parties and one moving object. A food order has three parties (customer, restaurant, dasher), two wait phases (kitchen prep and transit), and a product that degrades with every passing minute. That extra dimension turns matching, routing, and timing into a significantly harder problem.

Platforms like DoorDash, Uber Eats, and Deliveroo handle 50M+ orders per day across 500K+ restaurants with 2M active dashers. The core challenge is not any single piece. It is keeping three independent actors synchronized in real time while food is still hot.

This article builds on patterns from the ride-sharing case study. We will reuse driver location tracking and dispatch concepts but extend them for kitchen prep time, batched orders, and restaurant-side workflows.

1. Requirements

Functional

  • Customers browse menus, place orders, and track delivery in real time.
  • Restaurants receive orders, confirm prep times, and mark items ready.
  • Dashers get dispatch offers, navigate to restaurants, pick up orders, and deliver.
  • The platform computes estimated delivery times (EDT) that update live.
  • Support for scheduled orders (place now, deliver at 6 PM).
  • Promotions, surge pricing, and dasher incentive programs.

Non-functional

  • Scale: 50M orders/day, 500K restaurants, 2M dashers, 80M registered customers.
  • Latency: order placement under 300 ms p99, location updates processed in under 500 ms.
  • Availability: 99.99% for the ordering path. A one-hour outage at dinner rush costs millions.
  • Consistency: exactly-once order creation, at-least-once dispatch notifications.
  • Freshness: EDT shown to customer must reflect reality within 30 seconds.

2. Capacity estimation

Orders: 50M orders/day is roughly 580K orders/second at peak (assuming 10x peak-to-average). Each order averages 2 KB of metadata. That is 1.16 GB/s of write throughput at peak.

Location updates: 2M active dashers sending GPS pings every 4 seconds yields 500K updates/second. Each update is ~100 bytes, so 50 MB/s of location data.

Menu reads: customers browse 5 restaurants per order on average. At 50M orders/day that is 250M menu reads/day, roughly 30K QPS at peak. Menus average 50 KB each. Heavy caching is required here.

Storage: order history grows at ~100 GB/day (50M orders x 2 KB). Location traces at 50 MB/s accumulate 4.3 TB/day but are pruned after 7 days, keeping the hot set around 30 TB.

Bandwidth: the largest consumer is the tracking feed. 10M concurrent customers watching a map with updates every 5 seconds at 200 bytes each is 400 MB/s of outbound WebSocket traffic. Real-time systems patterns apply directly here.

3. High-level architecture

graph TD
  Customer[Customer App] -->|REST / GraphQL| API[API Gateway]
  Restaurant[Restaurant Tablet] -->|REST| API
  Dasher[Dasher App] -->|REST + WebSocket| API

  API --> OrderSvc[Order Service]
  API --> MenuSvc[Menu Service]
  API --> TrackingSvc[Tracking Service]

  OrderSvc -->|publish| MQ[Message Queue]
  MQ --> DispatchSvc[Dispatch Service]
  MQ --> NotifSvc[Notification Service]
  MQ --> ETASvc[ETA Service]

  DispatchSvc --> DasherLoc[(Dasher Location Store)]
  MenuSvc --> MenuDB[(Menu DB + Cache)]
  OrderSvc --> OrderDB[(Order DB)]
  TrackingSvc --> DasherLoc
  ETASvc --> MLModel[ML ETA Model]
  TrackingSvc -->|WebSocket push| Customer

High-level architecture for the food delivery platform. The API gateway fans out to domain services, with a message queue decoupling the write path from downstream consumers.

Three clients (customer, restaurant, dasher) hit a shared API gateway. The gateway routes to domain services: Order, Menu, Tracking, Dispatch, ETA, and Notification. The Order Service publishes events to a message queue. Downstream consumers (Dispatch, ETA, Notifications) react asynchronously.

Key data stores:

  • Order DB: sharded Postgres by order_id. 50M writes/day.
  • Menu DB: read-heavy, cached aggressively in Redis. 30K QPS peak reads.
  • Dasher Location Store: Redis Geo or a specialized geospatial index. 500K writes/sec.

4. Deep dive: order lifecycle

An order passes through well-defined states. Getting these transitions right prevents double charges, lost orders, and confused dashers.

stateDiagram-v2
  [*] --> Placed
  Placed --> Confirmed : restaurant accepts
  Placed --> Cancelled : customer cancels / timeout
  Confirmed --> Preparing : restaurant starts cooking
  Preparing --> ReadyForPickup : restaurant marks ready
  ReadyForPickup --> DasherAssigned : dispatch matches dasher
  DasherAssigned --> DasherAtRestaurant : dasher arrives
  DasherAtRestaurant --> PickedUp : dasher confirms pickup
  PickedUp --> InTransit : dasher navigating to customer
  InTransit --> Delivered : dasher confirms drop-off
  Delivered --> [*]

  Confirmed --> Cancelled : customer cancels (with fee)
  Preparing --> Cancelled : restaurant cancels (rare)
  DasherAssigned --> DasherReassigned : dasher cancels
  DasherReassigned --> DasherAssigned : new dasher found

State diagram for the order lifecycle. Most orders follow the happy path top to bottom. Cancellations and reassignments branch off at specific stages.

A few rules govern these transitions:

  1. Idempotency: every state transition carries a version number. The Order Service rejects stale updates. This prevents a race where two dashers both mark an order as picked up.
  2. Timeout guards: if a restaurant does not confirm within 90 seconds, the system auto-cancels and refunds. If a dasher does not arrive within the geofence window, the order gets reassigned.
  3. Compensation: once the restaurant starts preparing, a customer cancel triggers a partial charge. The Order Service writes a compensation event to the queue so billing and analytics stay consistent.

Order placement flow

sequenceDiagram
  participant C as Customer App
  participant AG as API Gateway
  participant OS as Order Service
  participant DB as Order DB
  participant MQ as Message Queue
  participant DS as Dispatch Service
  participant RS as Restaurant Tablet
  participant ES as ETA Service

  C->>AG: POST /orders (cart, address, payment token)
  AG->>OS: createOrder()
  OS->>DB: INSERT order (status=Placed)
  OS->>MQ: OrderPlaced event
  MQ->>RS: new order notification
  MQ->>DS: begin dasher matching
  MQ->>ES: compute initial EDT
  RS-->>OS: confirmOrder (prep estimate: 18 min)
  OS->>DB: UPDATE status=Confirmed
  DS-->>OS: dasherAssigned (dasher_id, ETA to restaurant)
  ES-->>C: EDT update (38 min total)

Sequence diagram for the order placement flow. The Order Service publishes a single event; downstream services react independently.

The key insight: the Order Service does not orchestrate. It publishes an OrderPlaced event and moves on. Dispatch, ETA, and the restaurant tablet all subscribe independently. This keeps the write path fast (under 100 ms) and lets each service scale on its own.

5. Deep dive: dispatch and dasher matching

Dispatch is the heart of the platform. A bad match means cold food or idle dashers. The algorithm optimizes for a global objective: minimize total delivery time across all active orders, not just one.

Inputs to the matching algorithm:

  • Dasher location, current load (0, 1, or 2 active orders for batching).
  • Restaurant location and estimated prep completion time.
  • Customer location and delivery priority.
  • Historical dasher speed for the area and time of day.
  • Dasher preferences (vehicle type, max distance, earnings mode).

Matching strategy: the Dispatch Service runs a bipartite matching optimization every 2 seconds. It collects all unassigned orders and all available dashers within a radius, then solves a cost-minimized assignment. The cost function weights:

FactorWeightWhy
Dasher-to-restaurant travel time0.35Biggest EDT component
Food wait time at restaurant0.25Dasher idle time is expensive
Restaurant-to-customer distance0.20Affects food quality
Dasher utilization0.10Spread load across dashers
Batching compatibility0.10Two orders on one trip saves cost

At 580K orders/second peak, the Dispatch Service partitions the problem by geographic zones. Each zone runs its own matching loop independently. Zones overlap slightly so dashers near a boundary can be considered by both.

Batching: the system may hold an order for up to 3 minutes if a nearby second order is likely to appear. This is a calculated trade-off. The customer waits slightly longer, but the platform serves two orders in one trip, reducing cost and increasing dasher earnings.

Dasher location ingestion: dashers send GPS pings every 4 seconds. The Tracking Service writes these to a Redis Geo set keyed by zone. When Dispatch needs nearby dashers, it issues a GEORADIUS query within a 5 km radius. At 500K writes/sec this store is sharded across 50+ Redis nodes, partitioned by geohash prefix. Stale entries (dashers who went offline) are evicted with a 30-second TTL.

Offer acceptance: when Dispatch picks a dasher, it sends an offer via push notification. The dasher has 30 seconds to accept. If they decline or timeout, the next-best dasher gets the offer. At peak, 15% of first offers are declined, so the system pre-computes a ranked list of 5 candidates per order.

6. Deep dive: ETA estimation

EDT is the single most visible number to the customer. Getting it wrong by 10 minutes destroys trust. The ETA Service combines:

  1. Prep time model: trained on per-restaurant historical data. A busy Friday night at a pizza place has a different distribution than a Tuesday lunch. The model outputs a distribution, not a point estimate. The system shows the 75th percentile (optimistic but realistic).

  2. Travel time model: uses real-time traffic data, dasher GPS traces, and road graph routing. Distinct from ride-sharing because the dasher may be on a bike or on foot for the last mile.

  3. Queue model: estimates how many orders are ahead of this one at the restaurant. Restaurants have finite kitchen capacity. A place that normally takes 15 minutes per order but has 8 pending orders will take much longer.

The combined EDT formula:

EDT = prep_time_remaining + dasher_to_restaurant_travel + pickup_buffer + restaurant_to_customer_travel

The pickup_buffer accounts for parking, walking into the restaurant, and verifying the order. It averages 4 minutes but varies by restaurant type. Drive-through pickups have a different profile than a downtown walk-up.

EDT updates push to the customer every 30 seconds via WebSocket. When the dasher picks up the order, the formula simplifies to just the travel time component, and accuracy improves significantly. At this stage the system switches from the composite model to a pure routing model, which historically achieves under 3 minutes of error at the p50.

7. Trade-offs and alternatives

Orchestration vs. choreography

We chose choreography (event-driven) over orchestration (a central coordinator calling each service in sequence). Trade-offs:

ApproachProsCons
ChoreographyServices scale independently, no single bottleneck, easier to add new consumersHarder to debug end-to-end, eventual consistency
OrchestrationEasier to reason about flow, simpler error handlingCentral coordinator becomes a bottleneck, tight coupling

DoorDash started with orchestration and migrated to choreography as scale grew past 10M orders/day. The debugging problem is solved with distributed tracing (every event carries a correlation_id).

Push vs. pull for restaurant orders

Two options for notifying restaurants:

  • Push (WebSocket to tablet): lower latency, but requires persistent connections to 500K devices.
  • Pull (tablet polls every 5 seconds): simpler infrastructure, but up to 5 seconds of delay.

Most platforms use push with a pull fallback. If the WebSocket drops, the tablet polls until reconnection. This hybrid approach keeps latency under 1 second for 95% of orders.

Single dasher vs. batched delivery

Batching two orders on one trip reduces platform cost by 30% but increases delivery time for the first customer by 5 to 8 minutes. Platforms handle this by:

  • Only batching when both restaurants are within 0.5 miles.
  • Showing the customer upfront that the order is batched (transparency reduces complaints by 40%).
  • Offering a “priority delivery” option for an extra fee.

8. What real systems actually do

Real platforms have iterated on these problems for a decade. Their solutions are more nuanced than any single design document can capture, but the broad strokes are instructive.

DoorDash uses a dispatch system called “Dispatch Optimizer” that runs constraint optimization every few seconds per zone. Their ETA model (DeepETA) uses a deep learning architecture trained on billions of historical deliveries. They shard Order DB by region, not by order ID, to keep geographic queries fast.

Uber Eats shares infrastructure with Uber rides. The location tracking, matching, and routing stacks are largely the same. The key addition is the restaurant integration layer, which handles wildly different POS (point-of-sale) systems.

Deliveroo emphasizes rider (dasher) experience. Their dispatch algorithm explicitly includes a fairness constraint: no rider should wait idle for more than 10 minutes during peak hours. This occasionally produces suboptimal delivery times but improves rider retention.

All three platforms invest heavily in simulation. Before deploying a new matching algorithm, they replay millions of historical orders to measure impact on EDT, dasher utilization, and food quality proxies (time between ready and pickup). This simulation infrastructure is itself a large distributed system.

9. What comes next

This design covers the happy path and the most common failure modes. Production systems also handle:

  • Multi-restaurant orders: a single cart spanning two restaurants requires split dispatch and synchronized delivery.
  • Surge pricing: dynamic delivery fees based on demand/supply ratio per zone, similar to ride-sharing surge.
  • Fraud detection: fake delivery confirmations, promo abuse, and account takeovers.
  • Restaurant onboarding: menu ingestion from PDFs, photos, and POS integrations. This is more of a data pipeline problem than a system design one.
  • Dasher safety: panic buttons, route deviation alerts, and insurance integrations.
  • Kitchen display systems: real-time dashboards for restaurant staff showing order queue, priority, and dasher arrival countdown.
  • Customer support automation: ML-powered refund decisions for missing items, wrong orders, and late deliveries. At 50M orders/day even a 1% issue rate means 500K support tickets daily.

The food delivery platform is one of the most operationally complex consumer systems. It touches payments, logistics, real-time ML, and physical-world coordination. Mastering the order lifecycle state machine, the dispatch optimization loop, and the ETA pipeline covers the core of what interviewers care about.

Start typing to search across all content
navigate Enter open Esc close