Service mesh concepts

In this series (14 parts)

Your services talk to each other over the network. That network is inside a cluster, but it is still a network. It drops packets, introduces latency, and offers zero authentication by default. Any pod can call any other pod. Nothing is encrypted. Nothing is observed. You only find out something is wrong when a user complains.

A service mesh fixes this at the infrastructure layer. It injects transparent proxies that handle mTLS, traffic routing, retries, and telemetry for every request. Your application code stays untouched.

What a service mesh adds

Three capabilities define a mesh.

Mutual TLS (mTLS). Every connection between services is encrypted and authenticated. Both sides present certificates managed by the mesh control plane. A compromised pod cannot impersonate another service because it lacks the correct identity certificate. This is zero-trust networking at the transport layer, applied automatically.

Observability. The proxy sitting in the data path sees every request. It emits metrics (latency, error rate, throughput), distributed traces, and access logs without any instrumentation in your code. You get a full topology map of service-to-service communication.

Traffic management. The mesh intercepts outbound traffic and applies routing rules. You can shift 5% of traffic to a canary deployment, mirror production requests to a staging service, or inject faults for chaos testing. All of this is configured declaratively through Kubernetes custom resources.

Sidecar vs ambient mode

There are two architectural patterns for injecting the data plane into your workloads.

graph LR
subgraph "Sidecar mode"
  A1[Client Pod] --> P1[Envoy Proxy]
  P1 -->|mTLS| P2[Envoy Proxy]
  P2 --> B1[Server Pod]
end

subgraph "Ambient mode"
  A2[Client Pod] --> Z1[ztunnel - Node]
  Z1 -->|mTLS| Z2[ztunnel - Node]
  Z2 --> W1[Waypoint Proxy]
  W1 --> B2[Server Pod]
end

Sidecar mode injects a proxy per pod. Ambient mode uses per-node ztunnels and optional waypoint proxies.

Sidecar mode is the traditional approach. The mesh injects an Envoy proxy container into every pod. All inbound and outbound traffic passes through this sidecar. It gives you full L7 control per workload, but the cost is real. Each sidecar consumes memory (50-100 MB typical) and adds a small latency hop. In a cluster with 500 pods, that is 500 extra containers to schedule and monitor.

Ambient mode removes the sidecar entirely. A shared ztunnel daemon runs on each node and handles L4 encryption (mTLS) for all pods on that node. When you need L7 features like header-based routing or request-level authorization, you deploy a waypoint proxy scoped to a service account or namespace. This cuts resource overhead significantly. Most workloads only need L4, and they get it without any per-pod injection.

The tradeoff is maturity. Sidecar mode has years of production hardening. Ambient mode reached general availability in Istio 1.22, and tooling is still catching up. For new deployments, ambient is the recommended starting point. For existing meshes with complex L7 policies, sidecar remains the safer choice.

Istio resources: VirtualService and DestinationRule

Istio extends the Kubernetes API with custom resources that control traffic behavior. Two resources form the core of traffic management.

A VirtualService defines how requests are routed to a service. It matches on URI paths, headers, or other attributes and directs traffic to specific subsets or versions.

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews-routing
  namespace: bookinfo
spec:
  hosts:
    - reviews
  http:
    - match:
        - headers:
            end-user:
              exact: testing
      route:
        - destination:
            host: reviews
            subset: v3
    - route:
        - destination:
            host: reviews
            subset: v2
          weight: 90
        - destination:
            host: reviews
            subset: v3
          weight: 10

This routes requests with the end-user: testing header to v3. All other traffic splits 90/10 between v2 and v3. Canary deployments become a YAML change, not a code change.

A DestinationRule configures what happens after routing. It defines subsets (versions), load balancing algorithms, connection pool limits, and circuit breakers.

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: reviews-destination
  namespace: bookinfo
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 60s
      maxEjectionPercent: 50
  subsets:
    - name: v2
      labels:
        version: v2
    - name: v3
      labels:
        version: v3
      trafficPolicy:
        loadBalancer:
          simple: ROUND_ROBIN

The outlierDetection block is the circuit breaker. If a pod returns 3 consecutive 5xx errors within 30 seconds, the mesh ejects it from the load balancing pool for 60 seconds. No client-side library needed.

Enforcing mTLS with PeerAuthentication

Istio can run in permissive mode (accept both plaintext and mTLS) or strict mode (reject plaintext). In production, you want strict. A PeerAuthentication resource enforces this.

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: bookinfo
spec:
  mtls:
    mode: STRICT

Apply this to the namespace and every service in bookinfo requires mTLS. Services outside the mesh cannot reach these workloads over plaintext. Roll this out namespace by namespace. Start with PERMISSIVE to verify all callers are in the mesh, then switch to STRICT.

For a mesh-wide default, apply the PeerAuthentication resource in the istio-system namespace. Per-namespace and per-workload policies override the mesh-wide default, giving you fine-grained control during migration.

When a service mesh is overkill

A mesh is not free. It adds operational complexity, resource overhead, and a new failure domain. The control plane (istiod) needs to be monitored and upgraded. Custom resources need to be understood by the team. Debugging network issues now involves checking proxy configuration, not just Kubernetes services.

Skip the mesh if:

You run fewer than 10 services. NetworkPolicy and application-level TLS cover most needs.
Your team does not have dedicated platform engineers. The learning curve is steep.
You do not need L7 traffic control. If all you want is encryption, consider a CNI plugin with WireGuard (like Cilium) instead.

Use a mesh when:

You need fine-grained traffic shifting for canary or blue-green deployments across many services.
Compliance requires mTLS everywhere and you cannot modify application code.
You want consistent observability (golden metrics, distributed tracing) without instrumenting each service individually.
You operate 50+ services and need circuit breaking, retries, and timeouts managed at the infrastructure layer.

The decision comes down to operational capacity. A mesh solves real problems, but only if your team can operate it. Start with ambient mode to minimize the resource footprint and complexity. Add waypoint proxies only for services that need L7 policies.

What comes next

This article wraps up the core Kubernetes orchestration series. Across 14 posts, you have built up from pod basics through deployments, networking, storage, security, Helm, operators, and now service mesh. These are the building blocks.

The next step is applying these concepts together in production. That means designing cluster topologies for multi-team organizations, building deployment pipelines that use the traffic management patterns covered here, and establishing the observability stack that ties metrics, logs, and traces into a single operational view. The orchestration fundamentals are in place. Now you put them to work.

← Back to all series