GraphQL server design
In this series (15 parts)
- Backend system design scope
- Designing RESTful APIs
- Authentication and session management
- Database design for backend systems
- Caching in backend systems
- Background jobs and task queues
- File upload and storage
- Search integration
- Email and notification delivery
- Webhooks: design and security
- Payments integration
- Multi-tenancy patterns
- Backend for Frontend (BFF) pattern
- GraphQL server design
- gRPC and internal service APIs
GraphQL is a query language for APIs. Instead of the server deciding what data each endpoint returns, the client sends a query describing the exact shape of data it wants. The server resolves that query by calling the appropriate data sources and returns a response that matches the query shape precisely.
This solves the over-fetching and under-fetching problems that drive teams toward the BFF pattern. But GraphQL introduces its own set of challenges: the N+1 query problem, authorization complexity, and the temptation to expose your entire data model as a public API.
For foundational API design principles, see API design. For caching strategies that interact with GraphQL’s single-endpoint model, see caching.
Schema design
The schema is your contract. It defines every type, field, and relationship that clients can query. Good schema design is the difference between a GraphQL API that scales for years and one that requires a rewrite in six months.
type Query {
product(id: ID!): Product
products(filter: ProductFilter, first: Int, after: String): ProductConnection!
}
type Product {
id: ID!
name: String!
description: String
price: Money!
category: Category!
reviews(first: Int, after: String): ReviewConnection!
seller: User!
}
type Money {
amount: Int!
currency: String!
}
type ReviewConnection {
edges: [ReviewEdge!]!
pageInfo: PageInfo!
totalCount: Int!
}
Design principles:
- Think in graphs, not tables. Your schema should model domain relationships, not mirror database tables. A
Producthasreviewsand aseller, expressed as graph edges, regardless of how many database tables back them. - Use connections for lists. The Relay connection spec (
edges,pageInfo,cursor) is the standard pagination pattern. It supports cursor-based pagination, which is more reliable than offset-based pagination for large datasets. - Custom scalars for domain types. Use
Money,DateTime,URLinstead of rawIntandString. This adds type safety and documentation to the schema. - Nullable by default, non-null by intention. Mark a field as non-null (
!) only when you can guarantee it will always have a value. A field that returns null during partial failure is better than one that throws an error.
Resolvers
Each field in the schema maps to a resolver function. The resolver fetches the data for that field. For scalar fields on a type, the default resolver simply reads the property from the parent object. For relationships and computed fields, you write explicit resolvers.
const resolvers = {
Query: {
product: (_, { id }, context) => context.dataSources.products.getById(id),
},
Product: {
reviews: (product, { first, after }, context) =>
context.dataSources.reviews.getByProductId(product.id, { first, after }),
seller: (product, _, context) =>
context.dataSources.users.getById(product.sellerId),
price: (product) => ({
amount: product.priceCents,
currency: product.currency,
}),
},
};
Resolvers are composable. The Product.reviews resolver does not care how the product was fetched. It receives the product object and fetches reviews independently. This composability is powerful but creates the N+1 problem.
The N+1 problem
Consider this query:
{
products(first: 20) {
edges {
node {
name
seller {
name
}
}
}
}
}
The products resolver runs one query to fetch 20 products. Then for each product, the seller resolver runs a separate query to fetch the seller. That is 1 + 20 = 21 database queries for a single GraphQL request. Add reviews to the query and you get 1 + 20 + 20 = 41 queries.
flowchart TB
subgraph Without["Without DataLoader"]
Q1["Query: 20 products"] --> DB1["1 DB query"]
Q1 --> S1["Seller for product 1"] --> DB2["1 DB query"]
Q1 --> S2["Seller for product 2"] --> DB3["1 DB query"]
Q1 --> S3["Seller for product 3"] --> DB4["1 DB query"]
Q1 --> Sdots["... 17 more"] --> DBdots["17 DB queries"]
end
subgraph With["With DataLoader"]
Q2["Query: 20 products"] --> DB5["1 DB query"]
Q2 --> Batch["Batch: all 20 seller IDs"] --> DB6["1 DB query<br/>WHERE id IN (...)"]
end
The N+1 problem and DataLoader’s batching solution. Without DataLoader: 21 queries. With DataLoader: 2 queries.
DataLoader
DataLoader solves N+1 by batching and caching within a single request. Instead of fetching each seller immediately, the resolver registers the seller ID with DataLoader. At the end of the current execution tick, DataLoader collects all registered IDs and makes a single batched query.
// Create a DataLoader per request
const userLoader = new DataLoader(async (userIds) => {
const users = await db.query(
"SELECT * FROM users WHERE id = ANY($1)",
[userIds]
);
// Return in the same order as the input IDs
const userMap = new Map(users.map(u => [u.id, u]));
return userIds.map(id => userMap.get(id) || null);
});
// In the resolver
const resolvers = {
Product: {
seller: (product, _, context) =>
context.loaders.user.load(product.sellerId),
},
};
Critical rules for DataLoader:
- Create a new instance per request. DataLoader caches results within its lifetime. Sharing across requests would return stale data.
- Return results in input order. The batch function must return an array aligned with the input keys array.
- Handle missing keys. Return
nullfor IDs that do not exist, do not skip them. - Scope to one data source. Each DataLoader should correspond to one table or service. Do not try to batch across unrelated sources.
DataLoader keeps query count constant regardless of result size. Without it, query count grows linearly.
Authorization at the resolver level
GraphQL’s flexible queries mean you cannot rely on endpoint-level authorization. A single query might access products (public), order history (requires authentication), and admin analytics (requires admin role). Authorization must happen at the resolver level.
const resolvers = {
Query: {
product: (_, { id }, context) => {
// Public, no auth needed
return context.dataSources.products.getById(id);
},
myOrders: (_, args, context) => {
if (!context.user) throw new AuthenticationError("Login required");
return context.dataSources.orders.getByUserId(context.user.id, args);
},
adminAnalytics: (_, args, context) => {
if (!context.user?.role === "admin")
throw new ForbiddenError("Admin access required");
return context.dataSources.analytics.getSummary(args);
},
},
Order: {
paymentDetails: (order, _, context) => {
// Field-level auth: only the order owner can see payment details
if (order.userId !== context.user?.id)
throw new ForbiddenError("Not your order");
return context.dataSources.payments.getByOrderId(order.id);
},
},
};
Patterns for authorization:
- Directive-based: define custom schema directives like
@auth(role: ADMIN)and apply them declaratively in the schema. - Middleware layer: wrap resolvers with an authorization function that checks permissions before executing the resolver.
- Inline checks: check permissions inside each resolver. Simple but repetitive.
The directive approach is cleanest for large schemas. The schema itself documents who can access what.
Persisted queries
In production, letting clients send arbitrary query strings is both a performance and security risk. An attacker can craft deeply nested queries that consume enormous server resources. A query like { user { friends { friends { friends { ... } } } } } nested 50 levels deep can bring your server down.
Persisted queries solve this. During your build step, extract all queries your clients use and register them with the server. At runtime, clients send a query hash instead of the full query text. The server looks up the hash, finds the pre-registered query, and executes it.
Benefits:
- Security: only pre-approved queries can run. No arbitrary queries from attackers.
- Performance: smaller request payloads (just a hash instead of a full query string). The server can pre-plan query execution.
- Caching: CDNs can cache responses keyed by the query hash via GET requests.
If you cannot use persisted queries (because your API is public and you want clients to write their own queries), implement query complexity analysis. Assign a cost to each field and reject queries that exceed a threshold.
Caching with GraphQL
GraphQL’s single endpoint (POST /graphql) makes HTTP caching harder than REST. With REST, each URL is a natural cache key. With GraphQL, every request hits the same URL with a different body.
Strategies:
| Strategy | How it works | Trade-offs |
|---|---|---|
| Response caching | Cache full responses keyed by query hash + variables | Coarse-grained, stale data risk |
| Persisted queries via GET | GET /graphql?id=abc&variables={} | CDN-cacheable, requires persisted queries |
| Field-level caching | Cache individual resolver results in Redis | Fine-grained, complex invalidation |
| DataLoader request cache | Deduplicate within a single request | Automatic, no cross-request benefit |
For most applications, field-level caching with DataLoader gives the best balance. Cache frequently accessed, slowly changing data (product details, user profiles) at the resolver level with TTLs. Let DataLoader handle within-request deduplication.
When GraphQL is the wrong choice
GraphQL is not universally superior to REST. It is the wrong choice when:
- Your API is simple. If you have ten CRUD endpoints with straightforward request/response shapes, GraphQL adds schema complexity without payoff.
- File uploads. GraphQL has no native file upload support. You end up bolting on multipart specs or using a separate REST endpoint anyway.
- Real-time streaming. GraphQL subscriptions exist but are more complex to scale than WebSockets or SSE with REST endpoints.
- Internal service-to-service calls. For backend microservice communication, gRPC is faster and provides stronger type safety with less overhead. GraphQL is designed for client-server boundaries.
- Your team is small. The schema, resolver, DataLoader, and authorization layers add development overhead. For a team of two, REST with well-designed endpoints ships faster.
Each API style has a sweet spot. GraphQL excels for client-facing APIs with complex, variable data needs. REST wins for simplicity. gRPC wins for internal services.
Schema evolution
GraphQL schemas evolve differently from REST APIs. There are no URL-based versions. Instead, you evolve the schema by adding new fields and deprecating old ones.
type Product {
id: ID!
name: String!
price: Money!
# Deprecated: use priceV2 which includes tax info
priceInCents: Int @deprecated(reason: "Use price field instead")
}
Rules for schema evolution:
- Never remove a field without deprecation. Add the
@deprecateddirective with a reason and a migration path. Monitor usage of deprecated fields. Remove only when usage drops to zero. - Never change a field’s type. If
pricewas anIntand you need it to be aMoneyobject, add a new field (priceV2or, better,priceasMoneywith the old field renamed topriceInCents). - Additive changes are safe. Adding new fields, new types, and new enum values never breaks existing clients.
- Use input types for mutations. Instead of positional arguments, use input objects. This makes adding optional fields backward-compatible.
Monitoring and observability
Track these metrics for your GraphQL server:
- Query complexity distribution: identify expensive queries before they cause outages.
- Resolver latency per field: find slow resolvers. A single slow resolver in a deeply nested query multiplies its impact.
- Error rate by field and query: catch resolver-level failures that might not surface as top-level errors.
- DataLoader batch sizes: small batches mean DataLoader is not coalescing well. Large batches might indicate a query that should be paginated.
- Cache hit rate per resolver: verify that your caching strategy is effective.
What comes next
GraphQL works well at the client-server boundary. For service-to-service communication inside your backend, the next article covers gRPC and internal service APIs: protocol buffers, streaming RPCs, deadlines, and when to use gRPC-gateway for HTTP fallback.