Search…
High Level Design · Part 3

Service discovery and registration

In this series (12 parts)
  1. Monolith vs microservices
  2. Microservice communication patterns
  3. Service discovery and registration
  4. Event-driven architecture
  5. Distributed data patterns
  6. Caching architecture patterns
  7. Search architecture
  8. Storage systems at scale
  9. Notification systems
  10. Real-time systems architecture
  11. Batch and stream processing
  12. Multi-region and global systems

In a monolith, calling another module is a function call to a known address in memory. In a microservice architecture, service instances come and go constantly: new versions deploy, containers restart, autoscalers add and remove instances based on load. Hardcoding IP addresses in configuration files does not survive this reality. Service discovery is the mechanism that lets services find each other dynamically.

The core problem

Suppose the order service needs to call the inventory service. In production, the inventory service runs on three instances behind a load balancer. Yesterday those instances were at 10.0.1.5, 10.0.1.6, and 10.0.1.7. Today, after a rolling deploy, they are at 10.0.2.10, 10.0.2.11, and 10.0.2.12. A container that crashed and restarted might have an entirely new IP.

The order service needs answers to two questions: what are the current healthy instances of the inventory service, and which one should I send this request to? Service discovery answers the first question. Load balancing answers the second.

Service registry

At the heart of any discovery mechanism is a service registry: a database of available service instances and their network locations. Think of it as a phone book that updates in real time.

graph TD
  IS1["Inventory Svc (10.0.2.10)"] -->|"register"| SR["Service Registry"]
  IS2["Inventory Svc (10.0.2.11)"] -->|"register"| SR
  IS3["Inventory Svc (10.0.2.12)"] -->|"register"| SR
  IS1 -->|"heartbeat"| SR
  IS2 -->|"heartbeat"| SR
  IS3 -->|"heartbeat"| SR
  OS["Order Service"] -->|"lookup: inventory-svc"| SR
  SR -->|"[10.0.2.10, 10.0.2.11, 10.0.2.12]"| OS

Service instances register themselves and send heartbeats. Consumers query the registry to discover healthy instances.

When a service instance starts, it registers itself with the registry, providing its address, port, health check endpoint, and metadata like version or region. It then sends periodic heartbeats to prove it is still alive. If the registry stops receiving heartbeats, it marks the instance as unhealthy and eventually removes it.

Consul and Eureka are the two most commonly referenced registry implementations. Consul uses a gossip protocol for membership and supports both service discovery and distributed key-value storage. Eureka, built by Netflix, follows a peer-to-peer replication model where registry nodes sync with each other.

Client-side discovery

In client-side discovery, the calling service queries the registry directly and picks an instance using its own load-balancing logic.

sequenceDiagram
  participant OS as Order Service
  participant SR as Service Registry
  participant IS1 as Inventory (10.0.2.10)
  participant IS2 as Inventory (10.0.2.11)
  OS->>SR: GET /services/inventory-svc
  SR-->>OS: [10.0.2.10, 10.0.2.11, 10.0.2.12]
  Note over OS: Apply round-robin
  OS->>IS1: GET /stock/item-42
  IS1-->>OS: {available: true}

Client-side discovery: the order service fetches the instance list, applies load balancing locally, and calls the selected instance.

The client typically caches the instance list and refreshes it periodically or on failure. Netflix Ribbon (now in maintenance mode, replaced by Spring Cloud LoadBalancer) popularized this approach. The client library handles retries, circuit breaking, and instance selection.

The advantage is simplicity in the infrastructure layer; no extra hop through a load balancer. The disadvantage is that every service needs a discovery-aware client library, coupling your services to the registry implementation. If you run services in multiple languages, you need client libraries for each one.

Server-side discovery

In server-side discovery, the caller sends requests to a load balancer or router that queries the registry on its behalf. The calling service only needs to know the load balancer’s address, which stays stable.

sequenceDiagram
  participant OS as Order Service
  participant LB as Load Balancer
  participant SR as Service Registry
  participant IS1 as Inventory (10.0.2.10)
  OS->>LB: GET inventory-svc/stock/item-42
  LB->>SR: Lookup inventory-svc
  SR-->>LB: [10.0.2.10, 10.0.2.11, 10.0.2.12]
  LB->>IS1: GET /stock/item-42
  IS1-->>LB: {available: true}
  LB-->>OS: {available: true}

Server-side discovery: the load balancer resolves the service name and routes the request.

AWS Elastic Load Balancer, Kubernetes Services, and Nginx with service discovery plugins all implement this pattern. The calling service is completely decoupled from the registry. The cost is an extra network hop through the load balancer and a dependency on the load balancer’s availability.

Kubernetes effectively eliminated this debate for most teams. Kubernetes Services provide server-side discovery out of the box: you access services by name (inventory-svc.default.svc.cluster.local), and kube-proxy handles routing to healthy pods. Most teams running on Kubernetes never implement custom service discovery.

DNS-based discovery

DNS is the internet’s original service discovery mechanism, and it works surprisingly well for many use cases. When the order service resolves inventory-svc.internal, DNS returns the IP addresses of healthy instances. SRV records can include port numbers and weights for load balancing.

graph TD
  OS["Order Service"] -->|"resolve inventory-svc.internal"| DNS["DNS Server"]
  DNS -->|"A: 10.0.2.10, 10.0.2.11, 10.0.2.12"| OS
  OS -->|"request"| IS1["Inventory (10.0.2.10)"]
  HC["Health Checker"] -->|"update records"| DNS
  HC -->|"check /health"| IS1
  HC -->|"check /health"| IS2["Inventory (10.0.2.11)"]
  HC -->|"check /health"| IS3["Inventory (10.0.2.12)"]

DNS-based discovery: a health checker updates DNS records. Services resolve names through standard DNS lookups.

Consul, AWS Cloud Map, and CoreDNS (the default in Kubernetes) all support DNS-based discovery. The beauty of DNS is universality: every language, every framework, every operating system knows how to resolve DNS. No special client library required.

The catch is TTL caching. DNS clients and intermediate resolvers cache responses, meaning changes in the instance list take time to propagate. Setting TTLs too low (1-5 seconds) creates a flood of DNS queries. Setting them too high means clients keep sending traffic to dead instances. A TTL of 10-30 seconds works for most production systems, with client-side retry logic to handle the gap.

Health checks

Discovery without health checks is just a static list with extra steps. Health checks ensure the registry reflects the actual state of each instance, not just whether the process is running.

A good health check verifies that the service can do real work: connect to its database, reach its dependencies, process a basic request. A /health endpoint that always returns 200 without checking anything is worse than useless because it creates a false sense of reliability.

There are three common health check models. In self-reporting, the service sends heartbeats to the registry. If heartbeats stop, the instance is presumed dead. This is simple but misses cases where the process is alive but cannot serve traffic (for example, a deadlocked thread pool). In external polling, the registry or a health checker periodically hits the service’s health endpoint. This catches more failure modes but adds polling traffic. In hybrid approaches, services send heartbeats and the registry performs periodic deep health checks.

graph TD
  subgraph HealthCheck["Health Check Pipeline"]
      L["Liveness: is the process running?"]
      R["Readiness: can it handle traffic?"]
      S["Startup: has it finished initializing?"]
  end
  L -->|"fail"| Restart["Restart container"]
  R -->|"fail"| Remove["Remove from LB pool"]
  S -->|"fail"| Wait["Wait, don't kill yet"]

Kubernetes-style health probes: liveness, readiness, and startup checks serve different purposes.

Kubernetes distinguishes between liveness probes (is the process stuck? restart it), readiness probes (can it handle traffic? remove it from the service endpoints), and startup probes (is it still initializing? give it time). This three-probe model prevents premature traffic routing during slow startups and catches zombie processes that are alive but unresponsive.

Registration patterns

There are two ways instances get into the registry. In self-registration, the service instance registers itself on startup and deregisters on shutdown. The service needs to know about the registry and handle registration logic, typically through a framework integration. Spring Cloud, for example, automatically registers services with Eureka.

In third-party registration, an external component watches for new instances and registers them. Kubernetes does this natively: the kubelet reports pod status, and the endpoints controller updates the Service object. Consul can use a registrator sidecar that watches Docker events and registers containers automatically.

Third-party registration keeps the service code clean of discovery logic. Self-registration gives the service more control over its metadata and health reporting. In practice, most container orchestration platforms handle registration transparently, making this a decision you rarely need to make explicitly.

Failure scenarios

Service discovery fails in interesting ways. A network partition can split the registry, causing different clients to see different sets of instances. A cascading failure can cause all instances to fail their health checks simultaneously, leaving the registry with an empty list. A thundering herd after a registry outage can overwhelm services as all clients simultaneously refresh their caches and send requests.

Good registries protect against these scenarios. Consul uses a quorum-based consensus protocol and supports stale reads to maintain availability during partitions. Eureka uses peer-to-peer replication and a self-preservation mode that stops deregistering instances when too many heartbeats fail simultaneously, under the assumption that the network is the problem, not the services.

What comes next

Synchronous discovery works for request-response interactions, but many microservice architectures rely heavily on events flowing between services. Continue with event-driven architecture to understand how event sourcing, CQRS, and the outbox pattern enable loosely coupled, eventually consistent systems.

Start typing to search across all content
navigate Enter open Esc close