API design: REST, GraphQL, gRPC
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
An API is a promise. It tells callers what they can send, what they will get back, and what guarantees hold under failure. Get the contract wrong and every team that depends on you pays the tax forever. Get it right and your system can evolve for years without breaking clients.
This article covers three dominant API styles, the gateway pattern that sits in front of them, and the rate limiting layer that keeps everything alive under pressure.
Prerequisites
You should read Message queues and async communication first. That article explains why not every call needs to be synchronous, which directly affects when you reach for an API and when you reach for a queue instead.
REST: constraints that earn you properties
REST is not “HTTP with JSON.” It is a set of architectural constraints that Roy Fielding described in his 2000 dissertation. When you follow the constraints, you get specific properties for free.
The six constraints are: client-server separation, statelessness, cacheability, a uniform interface, a layered system, and optional code-on-demand. Most teams get client-server and statelessness right. They struggle with uniform interface and cacheability.
A uniform interface means resources have stable identifiers (URIs), representations are self-descriptive, and hypermedia drives state transitions (HATEOAS). In practice almost nobody implements full HATEOAS. What matters is that your resources are nouns, your HTTP methods are verbs, and your status codes are honest.
Resource design
Good REST APIs model resources, not procedures. Compare these:
POST /api/createUser # RPC-style, avoid
POST /api/users # resource-style, prefer
A user resource at /api/users/4821 supports GET, PUT, PATCH, and DELETE. The URI identifies the thing. The method says what to do with it. This separation means caches, proxies, and load balancers can reason about your traffic without understanding your business logic.
Status codes matter
Return 201 Created when you create a resource, not 200 OK. Return 404 when a resource does not exist, not 200 with an empty body. Return 429 Too Many Requests when the client is being rate limited. These are not pedantic details. Proxies and CDNs use status codes to decide what to cache, what to retry, and what to drop.
Versioning strategies
APIs change. The question is how to change them without breaking existing clients. Three common approaches:
URI versioning puts the version in the path: /api/v1/users. It is simple, visible in logs, and easy to route. The downside is that every version looks like a completely different API, even when 95% of endpoints are identical.
Header versioning uses a custom header like Accept: application/vnd.myapp.v2+json. It keeps URIs clean but is harder to test in a browser and harder to spot in access logs.
Query parameter versioning appends ?version=2. Easy to use, easy to miss, easy to forget.
URI versioning wins for most teams. It is explicit, debuggable, and works with every HTTP tool ever built. Plan for at most two live versions at any time. Supporting three or more versions simultaneously means your team spends more time on compatibility shims than on features.
Pagination
Never return unbounded collections. A GET /api/users that returns 2 million rows will kill your database, saturate the network, and crash the client. Use cursor-based pagination for large datasets:
GET /api/users?cursor=eyJpZCI6NDgyMX0&limit=50
Cursor pagination gives you stable results even when new rows are inserted between pages. Offset-based pagination (?page=3&limit=50) is simpler but skips or duplicates rows under concurrent writes.
GraphQL: ask for exactly what you need
GraphQL was open-sourced by Facebook in 2015 to solve a specific problem: mobile clients on slow networks needed to fetch deeply nested data in a single round trip, without downloading fields they did not need.
A GraphQL query lets the client specify the exact shape of the response:
query {
user(id: "4821") {
name
email
orders(last: 5) {
id
total
items {
name
}
}
}
}
One request. One response. No over-fetching, no under-fetching. In REST, this same data might require three separate calls: one for the user, one for orders, one for order items.
sequenceDiagram participant C as Client participant R as REST API C->>R: GET /users/4821 R-->>C: user data C->>R: GET /users/4821/orders?last=5 R-->>C: orders list C->>R: GET /orders/91/items R-->>C: order items Note over C,R: 3 round trips for REST
REST often requires multiple round trips to assemble a single view. Each call adds network latency.
sequenceDiagram participant C as Client participant G as GraphQL API C->>G: POST /graphql (query) G-->>C: exact shape returned Note over C,G: 1 round trip for GraphQL
GraphQL collapses the same data fetch into a single request. The client specifies the shape; the server fills it in.
The trade-offs nobody mentions upfront
GraphQL is not free. The server must parse, validate, and execute an arbitrary query graph. A malicious or careless client can send a query that joins six levels deep and touches every row in your database. You need query complexity analysis and depth limiting from day one.
Caching is harder. REST responses live at a URL, so HTTP caches work naturally. GraphQL sends everything as POST /graphql, which means your CDN sees one endpoint. You need application-level caching (persisted queries, response caching by query hash) to get comparable cache hit rates.
Error handling is different. GraphQL always returns HTTP 200. Errors are embedded in the response body under a "errors" field. This confuses monitoring tools that count 4xx/5xx responses. Your alerting pipeline needs to parse response bodies, not just status codes.
Schema evolution replaces versioning. You add new fields without breaking old clients because clients only request what they know about. Deprecating a field is explicit: you mark it @deprecated in the schema and track usage. This is genuinely better than REST versioning for long-lived APIs with many clients.
When to choose GraphQL
GraphQL shines when you have many clients with different data needs (mobile vs. web vs. internal tools), deeply nested data models, and a team willing to invest in tooling. It is overkill for simple CRUD APIs with one or two consumers.
gRPC: when milliseconds and bytes matter
gRPC is a high-performance RPC framework built by Google on top of HTTP/2 and Protocol Buffers. Where REST sends human-readable JSON over HTTP/1.1, gRPC sends compact binary over multiplexed streams.
You define your service contract in a .proto file:
service UserService {
rpc GetUser (UserRequest) returns (UserResponse);
rpc ListUsers (ListRequest) returns (stream UserResponse);
}
message UserRequest {
string id = 1;
}
message UserResponse {
string id = 1;
string name = 2;
string email = 3;
}
The protobuf compiler generates client and server code in your language of choice. Type safety is enforced at compile time, not at runtime with JSON schema validation.
Performance numbers
In benchmarks, gRPC typically achieves 2x to 10x higher throughput than REST+JSON for the same payload. A 1 KB JSON user object might serialize to 300 bytes in protobuf. Deserialization is 5x to 20x faster because there is no string parsing. Over millions of requests per second between internal microservices, those savings translate directly into fewer servers and lower latency.
HTTP/2 multiplexing means multiple RPC calls share a single TCP connection without head-of-line blocking at the application layer. Streaming RPCs (server-stream, client-stream, bidirectional) let you push data continuously without polling.
When gRPC falls short
Browsers cannot call gRPC directly. You need gRPC-Web or a proxy that translates. Debugging is harder because payloads are binary. You cannot curl a gRPC endpoint and read the response. Tooling like grpcurl and Bloom RPC helps, but the developer experience is worse than REST for exploratory testing.
gRPC is the right choice for internal service-to-service communication where latency and throughput matter more than human readability. It is the wrong choice for public-facing APIs that developers discover and test in a browser.
Choosing the right style
The decision is not ideological. It is driven by your constraints.
| Factor | REST | GraphQL | gRPC |
|---|---|---|---|
| Client diversity | Good | Excellent | Poor (no browser) |
| Performance | Moderate | Moderate | Excellent |
| Caching | HTTP-native | Requires effort | No standard |
| Developer onboarding | Simple | Moderate | Steep |
| Streaming | Awkward (SSE, WebSocket) | Subscriptions | Native |
| Schema enforcement | Optional (OpenAPI) | Built-in | Built-in (protobuf) |
Many production systems use all three. REST for public APIs, GraphQL for frontend teams, gRPC between backend services. The API gateway makes this possible.
The API gateway pattern
An API gateway is a single entry point that sits between clients and your backend services. It handles concerns that do not belong in individual services: authentication, rate limiting, request routing, protocol translation, and response aggregation.
graph TD A[Mobile Client] --> GW[API Gateway] B[Web Client] --> GW C[Partner API] --> GW GW --> S1[User Service<br/>gRPC] GW --> S2[Order Service<br/>gRPC] GW --> S3[Search Service<br/>REST] GW --> S4[Analytics<br/>GraphQL] style GW fill:#ff9800,stroke:#e65100,color:#000
The API gateway translates between external protocols and internal ones. Clients see REST; services speak gRPC.
The gateway can translate a single REST request from a mobile client into parallel gRPC calls to three backend services, merge the responses, and return a single JSON payload. This is the Backend for Frontend (BFF) pattern, and it keeps protocol complexity out of client code.
Kong, Envoy, AWS API Gateway, and NGINX are common choices. The key decision is whether the gateway is a thin routing layer or a thick orchestration layer. Thin gateways are easier to operate. Thick gateways become bottlenecks and single points of failure.
A gateway also gives you a natural place to enforce cross-cutting policies. Every request passes through it, so you can attach authentication, logging, tracing headers, and rate limiting without modifying any downstream service.
Rate limiting at the API layer
Rate limiting protects your system from abuse, misbehaving clients, and cascading failures. It belongs at the API layer because that is where you first see incoming traffic and where you can reject bad requests before they consume backend resources.
The most common algorithms are fixed window, sliding window, and token bucket. A fixed window counter resets every 60 seconds. A sliding window smooths out bursts at window boundaries. A token bucket allows short bursts up to a maximum and then enforces a steady rate.
For a public API serving 10,000 clients, you might set 100 requests per minute per API key. For an internal service handling 50,000 RPS, you might set 5,000 requests per second per upstream service. The numbers depend on your capacity planning and the cost of each operation.
Return 429 Too Many Requests with a Retry-After header so clients know when to try again. Include rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) in every response so clients can self-throttle before hitting the wall.
Rate limiting interacts with your load balancing strategy. If you run rate limiting per gateway instance, a client hitting different instances gets N times the intended limit. Use a shared store (Redis is the standard choice, with roughly 0.1 ms per operation) to enforce global limits accurately.
For a deeper treatment of algorithms, distributed counters, and the interaction between rate limiting and backpressure, see the dedicated article on rate limiting.
Design principles that outlast any style
Regardless of whether you choose REST, GraphQL, or gRPC, certain principles hold.
Idempotency means sending the same request twice produces the same result. GET, PUT, and DELETE should always be idempotent. For non-idempotent operations like payments, use client-generated idempotency keys. Stripe processes billions of dollars through this pattern.
Backward compatibility means new fields are additive, old fields are never removed without a deprecation period, and enums are never renumbered. Breaking a client’s integration costs trust that takes months to rebuild.
Timeouts and retries belong on both sides. The client sets a deadline; the server respects it. If a downstream call will not finish in time, fail fast rather than holding the connection open. Exponential backoff with jitter prevents retry storms from amplifying failures.
Observability means every API call produces a trace ID, a latency measurement, and a status code. Without these three signals, debugging production issues is guesswork. Instrument at the gateway layer and propagate trace context through every downstream call.
What comes next
You have designed APIs and placed a gateway in front of them. But how do you protect those APIs from being overwhelmed? The next article, Rate limiting and throttling, covers the algorithms, distributed counters, and backpressure mechanisms that keep your system stable under load.