system-design.io

Service Mesh Concepts: Istio and Linkerd

The bridge: why this matters

Imagine a busy airport at peak hour. Every plane is its own team, every gate is a service endpoint, and every takeoff or landing is a network call. If each pilot had to negotiate runway access, weather checks, collision avoidance, and radio encryption on their own, the airport would collapse into confusion.

That is what microservices look like without a service mesh.

In a monolith, one function calls another and the runtime does the hard part for free. In microservices, that same call becomes a network request, and networks are unreliable, slow, and full of failure modes. A service mesh exists to take the messy, repetitive, high-risk parts of service-to-service communication and move them into infrastructure.

Istio and Linkerd are the two names most often discussed in this space. They solve the same general problem, but they do it with different philosophies. Istio emphasizes breadth and control; Linkerd emphasizes simplicity and efficiency.

The broken state before meshes

Before we get to the solution, we should understand the pain.

Microservices promised independent deployment, flexible scaling, and team autonomy. But they also introduced a new burden: every service now had to deal with retries, timeouts, encryption, load balancing, circuit breaking, tracing, and request routing. Worse, each team implemented these concerns differently, often in different languages and with different levels of correctness.

That creates a few classic failures:

One service retries too aggressively and amplifies an outage.
Another service never times out and ties up resources forever.
A third service sends plaintext traffic because someone forgot to configure TLS.
A fourth service is impossible to debug because no one can trace the request path.

A service mesh solves this by moving those concerns out of application code and into a shared networking layer.

What a mesh actually is

A service mesh is an infrastructure layer for service-to-service communication. It usually consists of:

A data plane, made up of proxies that intercept traffic.
A control plane, which configures those proxies centrally.
Policy, telemetry, and security features layered on top.

Here is the simplest mental model:

+------------------+       +------------------+
|   Service A      |       |   Service B      |
|   App Code       |       |   App Code       |
+------------------+       +------------------+
|   Sidecar Proxy  |<----->|   Sidecar Proxy  |
+------------------+       +------------------+
         ^                          ^
         |                          |
         +-------- Control Plane ----+

In this picture, the application does not know how traffic is routed, encrypted, or observed. The proxy does.

That is the key conceptual shift.

The atomic unit: the sidecar proxy

The basic building block of the classic service mesh is the sidecar. A sidecar is a small proxy process that lives next to your application container, usually in the same pod if you are using Kubernetes.

We should pause here because this is the most important architectural move in the whole story.

Before the sidecar, the application had to know how to do distributed systems work. After the sidecar, the application can mostly act like a normal app again, while the proxy handles network behavior around it.

Why sidecars exist

The sidecar solves a very practical problem: how do we make cross-cutting networking behavior consistent across many services and many languages?

The answer is to standardize the network boundary. Every request enters and exits through a proxy. That proxy can enforce policies, emit metrics, retry failures, and terminate TLS consistently.

+---------------------------------------------+
| Pod                                         |
|                                             |
|  +-----------+   +----------------------+   |
|  |  App      |   |     Sidecar Proxy    |   |
|  |  Service   |<->|  traffic interceptor |   |
|  +-----------+   +----------------------+   |
|                                             |
+---------------------------------------------+

Notice what we are not doing. We are not modifying the application to learn mesh behavior. We are not binding ourselves to a specific programming language. We are simply changing the traffic path.

Two planes: control and data

Every service mesh has two conceptual halves.

Data plane

This is the part that actually handles traffic. In Istio, the data plane is usually Envoy proxies. In Linkerd, it is the Linkerd proxy written in Rust.

The data plane is where live requests flow. If Service A calls Service B, the request typically passes through A’s proxy and B’s proxy.

Control plane

The control plane is the brain. It watches the environment, computes policy, and pushes configuration to the proxies. It does not sit in the request path.

That distinction matters. If the control plane goes down, the data plane usually keeps working with the last known config. This separation improves resilience and makes meshes operationally sane.

                +---------------------------+
                |      CONTROL PLANE        |
                |  policy, identity, config |
                +---------------------------+
                  |        |         |
                  v        v         v
               +------+ +------+ +------+
               | Pxy1 | | Pxy2 | | Pxy3 |
               +------+ +------+ +------+
                  |        |         |
               Service A Service B Service C

The core interview insight here is simple: the mesh is not a new application framework. It is a network governance layer.

Istio: the feature-rich mesh

Istio is the heavyweight option. It was created by Google, IBM, and Lyft and uses Envoy as its data-plane proxy. It is powerful, widely used, and designed for teams that need deep traffic control and advanced policy enforcement.

Why Istio became important

Istio emerged because large microservice systems needed fine-grained traffic management and security that plain Kubernetes could not provide. It became especially attractive in organizations that wanted sophisticated routing, traffic shaping, and zero-trust enforcement.

Earlier Istio releases had several separate control-plane components. Modern Istio consolidates much of that into istiod, which simplified operations significantly. That simplification is worth remembering because it reflects a broader theme in platform engineering: powerful systems tend to accumulate complexity unless someone actively refactors them.

The Envoy foundation

Istio’s data plane is built on Envoy. Envoy is a general-purpose high-performance proxy with dynamic configuration support via xDS. That means Istio can push new routes, policies, and endpoints to proxies without restarting workloads.

For interview purposes, the key point is this: Istio is expressive because Envoy is expressive.

Istio’s major capabilities

Istio is usually discussed in terms of four major capability areas:

Traffic management
Security
Observability
Policy and extensibility

Traffic management

Istio can split traffic, route by headers, support canary releases, mirror requests, inject faults, and enforce retries and timeouts.

That means you can do things like:

Send 90% of traffic to version 1 and 10% to version 2.
Route all requests from internal users to a beta backend.
Simulate latency to test resilience.
Eject failing endpoints from load balancing.

The key resources are:

VirtualService: defines routing behavior.
DestinationRule: defines policies applied to destinations.
Gateway: manages ingress and egress at the mesh edge.
ServiceEntry: registers external services into mesh policy.

Security

Istio supports mutual TLS, which means both sides of a connection prove their identity. This is a zero-trust model: every service must authenticate, even if it is inside the cluster.

Istio’s control plane can issue and rotate certificates automatically. In practice, that means you can encrypt internal service traffic without making each application manage certificates on its own.

Observability

Istio proxies can emit:

Metrics
Access logs
Traces

That gives operators visibility into request rates, error rates, and latency across the mesh. In large systems, this is often the difference between guessing and knowing.

Linkerd: the simplicity-first mesh

Linkerd takes a different path. It focuses on being lightweight, easy to operate, and narrow in scope. The modern Linkerd 2.x architecture uses a custom proxy written in Rust rather than Envoy.

That choice is not accidental. It reflects a philosophy: if your goal is a service mesh, do not build a general-purpose proxy and then adapt it. Build a proxy that exists only to serve the mesh.

Why Rust matters

Linkerd’s proxy is written in Rust for memory safety, predictable latency, and a smaller resource footprint. That makes it attractive for teams that care about per-pod overhead and operational simplicity.

The strategic idea is this:

Envoy is broad and configurable.
Linkerd proxy is narrow and optimized.

That makes Linkerd easier to reason about in many production environments.

Linkerd control plane

Linkerd’s control plane is intentionally compact. It usually includes:

Destination: service discovery and routing policy.
Identity: certificate issuance and workload identity.
Proxy injector: automatic sidecar injection.
Viz: observability tooling.

This is much easier to explain in interviews than a long list of specialized features. Linkerd is not trying to be a Swiss Army knife. It is trying to be the cleanest possible mesh for common service-to-service needs.

Automatic mTLS

Like Istio, Linkerd provides automatic mutual TLS. The important difference is that Linkerd emphasizes zero-config behavior. Once a namespace is meshed, traffic is encrypted and authenticated with little extra ceremony.

ServiceProfiles and retry budgets

Linkerd’s policy model is smaller than Istio’s. One of its key primitives is the ServiceProfile, which lets you define route-level behavior such as retries and timeouts.

A particularly interesting Linkerd concept is the retry budget. Instead of allowing unlimited retries or a simple fixed number of retries, Linkerd uses a budget model to avoid retry storms. That is a subtle but important design choice because uncontrolled retries can turn a small outage into a larger one.

What the mesh actually gives us

Now that we understand both systems, let’s organize the core benefits by intent rather than by product feature.

Reliability

Service meshes help keep failures contained.

They do this with:

Timeouts.
Retries.
Circuit breaking.
Load balancing.
Outlier detection.

These features do not make failures disappear. They make failures local, visible, and survivable.

Security

Meshes help enforce zero-trust networking.

They do this with:

mTLS.
Certificate rotation.
Service identity.
Authorization policies.

This means internal traffic no longer has to be trusted by default. That is a major security improvement for modern distributed systems.

Observability

Meshes help you understand what is happening between services.

They do this with:

Metrics.
Distributed tracing.
Access logs.
Request-level telemetry.

This is especially useful in systems where the problem is not “Is the service alive?” but “Which downstream dependency caused the latency spike?”

Core patterns: the interview lens

Let’s look at a few core patterns one by one.

Traffic splitting

Traffic splitting allows you to send a controlled percentage of requests to different versions of a service. This is the backbone of canary releases and A/B testing.

In practice, this helps teams validate new deployments safely. Instead of flipping the whole system at once, you expose a small subset of traffic to the new version and watch the results.

Retries and timeouts

Retries help with transient failure. Timeouts prevent a request from waiting forever.

The danger is overusing retries. If you retry everything too aggressively, you can overload a struggling downstream service. That is why retry budgets and circuit breakers matter.

Circuit breaking

The circuit breaker pattern borrows from electrical systems. If a downstream service is unhealthy, the breaker opens and stops traffic from piling up on a failing component.

That gives the downstream system time to recover while keeping the rest of the system stable.

CLOSED -> OPEN -> HALF-OPEN -> CLOSED
   |         |          |           |
   |         |          +-- probe --+
   +-- normal requests

The half-open state is especially important. It lets the system probe recovery without immediately flooding the dependency again.

A practical comparison

Both Istio and Linkerd are service meshes, but they are built for different kinds of teams.

Area	Istio	Linkerd
Proxy	Envoy	Rust proxy
Main strength	Feature richness	Simplicity
Configuration surface	Large	Small
Operational style	Deep control	Low friction
Best for	Complex enterprise routing	Teams wanting ease and low overhead
Learning curve	Steeper	Gentler

Istio is the better choice when you need advanced routing, traffic engineering, or highly customized policy. Linkerd is often the better choice when you want a mesh that is fast to understand and cheap to operate.

This is not about “which is better” in the abstract. It is about which trade-off your team is making.

Modern evolution: ambient mesh

Sidecars solve a real problem, but they also create real overhead. Every pod gets a proxy. That means more CPU, more memory, more complexity, and more rollout surface area.

Istio’s ambient mesh was created to address that pain. In ambient mode, Istio splits responsibilities between node-level and service-level components rather than placing a proxy inside every pod.

The new shape

Ambient mesh introduces two important ideas:

ztunnel: a node-level component that handles L4 security and traffic interception.
Waypoint proxies: optional L7 proxies for services that need richer policy and routing.

That means you can get many of the benefits of a mesh without paying the full sidecar tax for every workload.

Node
+---------------------------------------------+
| Pod A      Pod B      Pod C                 |
|  |           |           |                   |
|  +-----------+-----------+-------------------+
|              ztunnel                         |
+---------------------------------------------+

Optional per-service waypoint for L7 features

The architectural lesson is broader than Istio itself: platform teams constantly search for ways to keep the benefits of abstraction while reducing the cost of applying it everywhere.

Interview answers you should know

What problem does a service mesh solve?

A service mesh centralizes service-to-service networking concerns such as routing, retries, timeouts, mTLS, and observability. It removes duplicated logic from application code and applies policies consistently across services.

What is the difference between the control plane and the data plane?

The data plane handles live traffic. The control plane computes configuration and pushes it to the proxies. The control plane does not sit on the request path.

Why use a sidecar?

A sidecar lets you attach mesh behavior to a service without changing the service’s code. This makes the mesh language-agnostic and easy to standardize across teams.

How does mTLS work in a mesh?

Each service gets a certificate from the mesh’s identity system. Proxies authenticate each other before traffic is exchanged. That gives you encryption plus workload identity.

When would you choose Istio over Linkerd?

Choose Istio when you need deep traffic policy, advanced routing, broad extensibility, and enterprise-scale control. Choose Linkerd when you want a smaller operational footprint and simpler day-two operations.

The mental model to keep

If you remember only one thing, remember this:

A service mesh is not about adding magic to microservices. It is about separating application logic from network logic.

That separation is what makes the architecture powerful. It allows teams to standardize retries, encryption, routing, and observability without forcing every developer to become a networking expert.

Istio and Linkerd represent two different answers to the same question:

How much control do we need?
How much complexity can we afford?

Istio answers: as much as needed.

Linkerd answers: as little as possible.

That trade-off is the heart of the topic.