Search…

Design a ride-sharing service

In this series (18 parts)
  1. Design a URL shortener
  2. Design a key-value store
  3. Design a rate limiter
  4. Design a web crawler
  5. Design a notification system
  6. Design a news feed
  7. Design a chat application
  8. Design a video streaming platform
  9. Design a music streaming service
  10. Design a ride-sharing service
  11. Design a food delivery platform
  12. Design a hotel booking platform
  13. Design a search engine
  14. Design a distributed message queue
  15. Design a code deployment system
  16. Design a payments platform
  17. Design an ad click aggregation system
  18. Design a distributed cache

Ride-sharing is one of the hardest real-time matching problems in production today. You need to ingest millions of GPS pings per second, index them geospatially, match riders to nearby drivers in under 2 seconds, price the trip dynamically, and track every state transition from request to drop-off. The LLD ride-sharing article covers object modeling. Here we tackle the distributed systems that make it work at scale.

1. Requirements

Functional

  • Riders request a ride by providing pickup and drop-off locations.
  • The system finds the best available driver within a radius and sends a match offer.
  • Drivers accept or decline within 15 seconds. On decline or timeout, the system re-matches.
  • Both parties see live GPS tracking during the trip.
  • Surge pricing adjusts fares based on local supply and demand.
  • Trip history, receipts, and ratings are stored permanently.

Non-functional

MetricTarget
Registered riders100M
Active drivers (peak)5M
Trips per day20M
Driver location updates5M drivers x 1 update/3s = 1.67M writes/s
Match latency (p99)< 2s
Location staleness< 5s
Trip state consistencyExactly-once transitions
Availability99.99% for matching, 99.9% for billing

2. Capacity estimation

Location ingestion. Each update is roughly 100 bytes (driver ID, lat, lng, timestamp, heading, speed). At 1.67M writes/s that is 167 MB/s or about 14 TB/day of raw location data. We keep hot locations in memory and archive cold data.

Trip storage. 20M trips/day, each trip record is around 2 KB (route polyline, fare breakdown, timestamps, ratings). That is 40 GB/day, or about 14.6 TB/year. Database sharding by city or rider ID keeps query latency low.

Matching QPS. 20M trips/day is roughly 230 ride requests/s average, with peak at 3x giving 700 requests/s. Each request fans out to 10-20 candidate drivers, so the matching service handles 7K-14K candidate evaluations per second.

Bandwidth. Live tracking pushes location to riders once per second during active trips. With 1M concurrent trips at peak, that is 1M messages/s at 50 bytes each: 50 MB/s outbound over WebSockets. Add driver-side push notifications for offers and trip updates and we are looking at roughly 80 MB/s total outbound bandwidth.

Read vs write ratio. The location index is extremely write-heavy: 1.67M writes/s for driver updates versus roughly 14K reads/s for matching queries. This 120:1 write-to-read ratio drives us toward an in-memory store like Redis rather than a disk-based database.

3. High-level architecture

graph TD
  R[Rider App] -->|REST / gRPC| GW[API Gateway]
  D[Driver App] -->|WebSocket| LS[Location Service]
  GW --> MS[Matching Service]
  GW --> TS[Trip Service]
  GW --> PS[Pricing Service]
  LS -->|1.67M writes/s| GI[(Geospatial Index<br/>Redis + S2)]
  MS -->|query nearby| GI
  MS -->|dispatch offer| NQ[Notification Queue]
  NQ -->|push| D
  TS -->|state machine| TDB[(Trip DB<br/>Postgres)]
  TS -->|events| EV[Event Bus<br/>Kafka]
  PS -->|read demand| GI
  PS -->|read supply| GI
  EV --> AN[Analytics]
  EV --> BL[Billing Service]
  BL --> PAY[Payment Gateway]
  R -->|WebSocket| TRK[Tracking Service]
  TRK -->|subscribe| GI

High-level architecture of the ride-sharing platform. The geospatial index is the central nervous system connecting location ingestion, matching, pricing, and tracking.

The driver app maintains a persistent WebSocket to the Location Service. Rider-facing reads go through the API Gateway. The Event Bus (message queues via Kafka) decouples trip lifecycle events from downstream consumers like billing and analytics.

4. Deep dive: location ingestion and geospatial indexing

The ingestion pipeline

5M active drivers send GPS updates every 3 seconds. That is 1.67M writes/s. Each update must land in the geospatial index within 2 seconds so the matching service sees fresh driver positions.

The Location Service is a stateless fleet of servers behind a load balancer. Drivers connect over WebSocket. Each server batches incoming updates and writes them to the geospatial index in micro-batches of 500ms. This reduces per-write overhead and lets us hit the throughput target with a cluster of 20-30 Redis instances.

Geospatial indexing with S2 cells

We divide the Earth’s surface into hierarchical cells using Google’s S2 geometry library. Each cell at level 12 covers roughly 3-6 km², a good resolution for urban ride matching. A driver’s S2 cell ID is computed from their lat/lng coordinates.

graph TD
  subgraph S2 Cell Hierarchy
      L6[Level 6 Cell<br/>~800 km²]
      L9a[Level 9 Cell<br/>~50 km²]
      L9b[Level 9 Cell<br/>~50 km²]
      L12a[Level 12 Cell<br/>~5 km²]
      L12b[Level 12 Cell<br/>~5 km²]
      L12c[Level 12 Cell<br/>~5 km²]
      L12d[Level 12 Cell<br/>~5 km²]
      L6 --> L9a
      L6 --> L9b
      L9a --> L12a
      L9a --> L12b
      L9b --> L12c
      L9b --> L12d
  end
  subgraph Redis Sharding
      S1[Shard 1<br/>Cells 0-N/3]
      S2s[Shard 2<br/>Cells N/3-2N/3]
      S3[Shard 3<br/>Cells 2N/3-N]
  end
  L12a -.->|hash of cell ID| S1
  L12b -.->|hash of cell ID| S2s
  L12c -.->|hash of cell ID| S3
  L12d -.->|hash of cell ID| S1

S2 cell hierarchy mapping to Redis shards. Each level-12 cell is assigned to a shard by consistent hashing on the cell ID.

The index is a Redis cluster where each shard stores a sorted set per S2 cell. The key is the cell ID, members are driver IDs, and scores are timestamps. To find drivers near a pickup point, we compute the covering set of S2 cells within the search radius (typically 3-5 km) and query each cell’s sorted set. This turns a radius search into a small number of exact key lookups.

Why S2 over geohash? Geohash cells vary in shape near the poles and have edge discontinuities where adjacent cells share no common prefix. S2 cells are uniform and hierarchical, making radius queries more predictable. The trade-off is higher computational cost per cell operation, but at our scale the CPU overhead is negligible compared to the network round-trips saved by fewer false-positive cells.

Handling stale locations

Drivers disconnect, enter tunnels, or close the app. A background sweeper runs every 10 seconds and removes entries older than 30 seconds from the sorted sets using ZRANGEBYSCORE to find expired entries. This prevents the matching service from dispatching to ghost drivers. Caching TTLs alone are insufficient here because we need the removal to happen within a bounded window, not lazily on next read.

When a driver reconnects after a brief disconnection, the Location Service re-inserts them into the index on the next GPS update. No explicit “reconnect” handshake is needed since the index is purely position-based. For longer outages (app killed, phone off), the driver must explicitly go online again through the app, which triggers a fresh WebSocket connection and resumes location publishing.

5. Deep dive: ride matching

When a rider requests a ride, the Matching Service executes this flow:

sequenceDiagram
  participant R as Rider App
  participant GW as API Gateway
  participant MS as Matching Service
  participant GI as Geospatial Index
  participant PS as Pricing Service
  participant D as Driver App

  R->>GW: POST /rides (pickup, dropoff)
  GW->>PS: GetPrice(pickup, dropoff, demand)
  PS-->>GW: fare estimate + surge multiplier
  GW-->>R: fare confirmation screen
  R->>GW: CONFIRM ride request
  GW->>MS: Match(rideId, pickup, radius=3km)
  MS->>GI: QueryNearby(pickup, 3km)
  GI-->>MS: candidate drivers (sorted by distance)
  MS->>MS: rank by ETA, rating, acceptance rate
  MS->>D: OFFER (rideId, pickup, fare) via push
  Note over D: 15s timeout
  D->>MS: ACCEPT
  MS->>GW: matched (driverId)
  GW->>R: driver assigned + ETA

Ride request and match sequence. The matching service queries the geospatial index, ranks candidates, and dispatches an offer with a 15-second timeout.

The ranking algorithm

Raw distance is a poor proxy for arrival time. A driver 1 km away on the wrong side of a highway may take 10 minutes while a driver 2 km away on the same road takes 3 minutes. The ranker combines:

  1. Estimated time of arrival (ETA) from a routing engine (weighted 0.5).
  2. Driver acceptance rate over the last 50 offers (weighted 0.2). Low-acceptance drivers waste match cycles.
  3. Driver rating (weighted 0.15). Higher-rated drivers improve rider experience.
  4. Trip heading alignment (weighted 0.15). A driver already heading toward the pickup location is preferred.

The top-ranked driver gets the offer. On timeout or decline, the next candidate receives it. After 3 failed attempts, we expand the search radius to 5 km and re-query.

Consistency under concurrent requests

Two riders in the same area may both see the same top-ranked driver. We use optimistic locking on the driver’s availability status. When the Matching Service dispatches an offer, it atomically sets the driver’s state to OFFERED in Redis with a 15-second TTL. If a second match attempt reads this driver, it skips to the next candidate. On acceptance, the state transitions to ON_TRIP. On timeout, the TTL expires and the driver becomes available again automatically.

6. Deep dive: trip state machine

A trip moves through well-defined states. Every transition is persisted as an event in the Trip DB and published to the event bus.

stateDiagram-v2
  [*] --> REQUESTED
  REQUESTED --> MATCHING: find driver
  MATCHING --> DRIVER_ASSIGNED: driver accepts
  MATCHING --> CANCELLED: no driver found / rider cancels
  DRIVER_ASSIGNED --> DRIVER_ARRIVING: driver en route
  DRIVER_ARRIVING --> WAITING: driver at pickup
  WAITING --> IN_PROGRESS: rider picked up
  WAITING --> CANCELLED: rider no-show (5 min timeout)
  IN_PROGRESS --> COMPLETED: arrived at dropoff
  IN_PROGRESS --> CANCELLED: emergency cancel
  COMPLETED --> [*]
  CANCELLED --> [*]

Trip lifecycle state machine. Each transition fires an event to the bus for billing, analytics, and notification consumers.

Exactly-once state transitions

The Trip Service uses a Postgres row-level lock on the trip record. Each transition is a conditional update: UPDATE trips SET state = 'IN_PROGRESS' WHERE id = ? AND state = 'WAITING'. If the affected row count is zero, the transition is rejected. The event is published to Kafka within the same database transaction using the transactional outbox pattern. A separate relay process tails the outbox table and pushes events to Kafka, guaranteeing exactly-once delivery even if the service crashes mid-transition.

Surge pricing

The Pricing Service computes a surge multiplier per S2 level-9 cell (roughly 50 km²). Every 30 seconds it reads:

  • Demand: count of open ride requests in the cell over the last 2 minutes.
  • Supply: count of available drivers in the cell right now.

The surge multiplier follows a simple curve: multiplier = max(1.0, demand / (supply * 0.7)), capped at 5.0x. The 0.7 factor ensures surge kicks in before supply is fully exhausted, giving drivers time to reposition. Surge multipliers are cached with a 30-second TTL and served to the Pricing Service on each fare estimate.

7. Trade-offs and alternatives

DecisionOur choiceAlternativeWhy we chose it
Geospatial indexRedis + S2 cellsPostGIS with R-treeRedis gives sub-ms reads at 1.67M writes/s. PostGIS is better for complex polygon queries but slower for point-in-radius at this write throughput.
Driver stateOptimistic locking in RedisDistributed lock (Redlock)Optimistic locking is simpler and faster. False positives (skipping an available driver) are harmless since we retry.
Event deliveryTransactional outbox + KafkaCDC with DebeziumOutbox gives us explicit control over event schema. CDC is lower-effort but couples event format to DB schema changes.
Matching strategySequential offer (one driver at a time)Broadcast to top-3 driversSequential avoids wasted driver attention. Broadcast reduces match latency at the cost of declined-offer overhead.
Surge granularityS2 level-9 cellsNeighborhood polygonsS2 cells are uniform and computable. Custom polygons require manual curation per city.

8. What real systems actually do

Uber moved from a simple geohash grid to Google S2 cells around 2014-2015 to handle edge cases at cell boundaries. Their matching service (formerly known as “Supply”) processes millions of candidate evaluations per second. Lyft uses a similar cell-based approach with their own spatial indexing layer built on top of Redis.

Both companies moved from simple distance-based matching to ETA-based matching early on. The routing engine is a critical dependency: if ETA computation is slow, matching latency suffers for every request.

Surge pricing remains controversial. Uber experimented with “upfront pricing” where the rider sees a fixed fare regardless of the route taken. This shifts financial risk from the rider to the platform but requires accurate route prediction.

For trip state management, both Uber and Lyft use event-sourced architectures where the current trip state is derived from a log of events. This makes debugging production issues much easier since you can replay the exact sequence of state transitions.

State machines look simple on paper but become complex when you account for edge cases: what happens if the driver’s phone dies mid-trip? What if the rider reports a safety issue? What if the payment fails after trip completion? Each of these spawns additional states and transitions that the core state machine must handle gracefully without leaving trips in limbo.

9. What comes next

This design handles the core matching and trip management loop. Production systems add several more layers:

  • Pooling (shared rides): matching multiple riders heading in the same direction to a single driver. This turns the matching problem from 1:1 into an optimization problem over route compatibility.
  • Scheduled rides: pre-matching drivers to future trips requires a reservation system with overbooking and cancellation handling.
  • Driver positioning: proactively suggesting where idle drivers should go based on predicted demand. This is a forecasting problem using historical trip data.
  • Fraud detection: fake GPS locations, driver-rider collusion on surge pricing, and bonus abuse all require real-time anomaly detection pipelines.
  • Multi-modal: integrating bikes, scooters, and public transit options into the same matching and routing engine.

The geospatial index and state machine patterns from this design apply directly to food delivery, logistics, and any system that matches moving supply with location-dependent demand.

Each of these extensions deserves its own deep dive, but the foundation we built here (geospatial indexing for fast spatial queries, optimistic locking for concurrent matching, event-sourced state machines for trip lifecycle, and cell-based surge pricing) provides the scaffolding they all build on top of.

Start typing to search across all content
navigate Enter open Esc close