A cluster is a group of independent computers that cooperate so closely that, from the outside, they look like a single system. The motivation is older than the term. Mainframes already gave operators a way to share one large machine across many workloads, but mainframes were expensive and did not scale beyond what one vendor could build into a single box. The first real shift came in 1994, when Thomas Sterling and Donald Becker built Beowulf at NASA: a cluster of commodity PCs running Linux, connected by ordinary Ethernet, that ran scientific code at a fraction of the cost of a supercomputer. The idea that you could buy capacity in small pieces and stitch it together with software took hold from there.
The web era pushed cluster design in a different direction. Search engines, social networks, and online shops needed to serve millions of concurrent users with low latency and almost no downtime. The driving question was no longer “how do I do one big computation faster,” but “how do I keep a service available when individual machines, racks, and even data centers fail.” That question shapes most of what follows: how the network is built, how machines are scheduled, how state is replicated, and how requests are spread across servers.
The most recent shift comes from machine learning. Training a large model is one enormous computation: thousands of GPUs working on the same gradient update, exchanging tens of gigabits per second of intermediate data on every step. The constraints that dominate are GPU-to-GPU bandwidth inside a server, high-speed lossless networking between servers, and tight scheduling so that all participants in a step start and finish together. AI clusters look more like supercomputers than like web fleets, and several cluster-design choices discussed below exist because of them.
The commodity hardware bet
Most of what follows depends on a design choice Google made early and described publicly in a 2003 paper by Luiz Barroso, Jeffrey Dean, and Urs Hölzle, titled Web Search for a Planet: The Google Cluster Architecture. The choice was to build the data center out of large numbers of ordinary PCs rather than a smaller number of expensive servers.
The reasoning was economic. The fastest available CPU costs much more than the second-fastest CPU, and the cost premium grows faster than the speed advantage. Two slower machines usually do more total work per dollar than one faster machine. The same is true for disks and memory. If the workload can be split across many machines (as a search index or a web service can), then performance per dollar and performance per watt become the figures of merit, not single-machine peak performance.
This choice has a consequence that shapes cluster design: at any given moment, something is broken. With tens of thousands of commodity machines, multiple disk failures per day are normal and one or more dead machines is the steady state. There is no point in paying extra for enterprise-grade components to make hardware failures rare; failures will still happen, and the software has to handle them. The right place to invest is in software that detects failures, masks them from users, and reschedules work onto healthy machines. Every system covered below, including Borg, Kubernetes, storage clusters, and load balancers, assumes failure is normal and is built to keep working through it.
A taxonomy of clusters
Different clusters exist for different reasons, and the design choices follow from the goal. Five categories appear repeatedly.
A high availability cluster (commonly written HA) is built to mask machine failures from clients. A primary node serves requests; a standby node monitors it and takes over if the primary fails. Storage is shared or replicated so the standby has the data it needs. The system’s value is measured in nines of uptime, not throughput.
A high performance computing cluster (commonly written HPC) is built to make a single large computation finish faster. Scientific simulations, weather forecasting, fluid dynamics, and large-scale machine learning all fit here. Nodes cooperate tightly through a fast interconnect, often using MPI to exchange messages, and the workload is usually batch: submit a job, wait for it to run to completion, collect the output.
A load-balancing cluster fronts a fleet of identical servers with a dispatcher that spreads incoming requests. Web servers, application servers, and API gateways are the typical members. Throughput grows roughly linearly with the size of the fleet, and the cluster keeps serving even if some members die.
A storage cluster pools disk capacity from many machines into a single namespace. The data is replicated or erasure-coded so that disk and machine failures do not cause data loss. Hadoop’s HDFS, Ceph, and Amazon S3 are examples at very different scales.
A scheduling cluster, sometimes called a batch or compute cluster, treats the whole fleet as a pool of CPU, memory, disk, and network and accepts work submissions from many users at once. The scheduler decides which job runs on which machine. Borg, Mesos, YARN, and Kubernetes all fit this description.
These categories overlap. A modern production cluster usually combines several of them: it is a scheduling cluster running services that are themselves load balanced, on top of a storage cluster, with HA properties built into the scheduler.
The single system image
A single system image is the illusion that many machines behave like one logical system. A user or operator sees one logical resource even though many physical machines back it. A cluster aims for this, but never reaches it perfectly. There are degrees:
-
A user submitting a job through
kubectl applydoes not pick a machine. The control plane does. From the user’s perspective there is one system, even though dozens of nodes are running pieces of the application. -
A client opening
https://example.comdoes not pick a server. DNS, anycast, and a load balancer steer the connection to one of many backends. From the client’s perspective there is one server. -
An administrator looking at a cluster’s storage pool sees one namespace and one quota, not the dozens of disks holding the bytes.
The job of cluster software is to maintain this illusion while individual machines come and go.
Networks inside a cluster
The performance and shape of a cluster depend on its network. Two facts dominate the design. Server-to-server traffic inside the cluster is now much larger than the traffic between the cluster and the outside world. And the network needs to give any pair of servers a fast path, not just servers that happen to be in the same rack.
Top-of-rack switching
Servers are mounted in racks, with roughly forty machines per rack. Each rack has one or two top-of-rack switches (ToR switches) at the top of the rack, with short copper cables running down to each server. The ToR switch is the only thing a server’s network traffic touches before it leaves the rack.
This works well for traffic that stays in the rack, but a single ToR switch cannot connect every server in the data center. It needs to uplink to something larger.
Spine-leaf (Clos) fabrics
Modern clusters use a two-level switching topology called a spine-leaf or Clos fabric. The ToR switches at the top of every rack are the leaf switches. Above them, a smaller number of spine switches connect every leaf to every other leaf. Every leaf has a link to every spine, and packets between two racks always take exactly two hops: leaf, spine, leaf.
The benefit is bisection bandwidth: the minimum bandwidth across any cut that splits the network into two equal halves. It is the worst-case capacity available for traffic between the two halves, so it bounds how well the network handles all-to-all communication. In a spine-leaf design with a full mesh between leaves and spines, bisection bandwidth scales with the number of spines. Adding more spines adds more bandwidth between any two racks without rewiring the leaves.
Two design properties follow. First, every spine offers an equally good path between any two leaves, so the network can spread different flows across different spines and avoid hot spots. Second, the topology is symmetric, which means a workload’s performance does not depend on where the scheduler happens to place it.
East-west versus north-south traffic
When a web user loads a page, the request enters the data center from the outside, travels to a front-end server, and the response leaves. This is north-south traffic. North-south used to dominate cluster networks.
Now consider what happens behind that front-end. It calls a search service, which calls a ranking service, which calls a feature store, which calls a cache, which calls a database. Every one of those calls is between two servers inside the cluster, not between a server and the outside world. This is east-west traffic. In a typical request fanout, one inbound HTTP request can cause hundreds of internal calls. East-west traffic has grown to dominate cluster bandwidth, and spine-leaf fabrics are designed for exactly this pattern.
High-speed interconnects
Standard Ethernet at one or ten gigabits is too slow for the inner loop of HPC and large machine learning workloads. Several technologies push the envelope.
NIC offloads move selected packet-processing work from the CPU to the network interface card. Common examples are checksum offload, TCP segmentation offload, receive-side scaling, and large receive offload. These features reduce CPU overhead without replacing the operating system’s TCP/IP stack.
Remote direct memory access, or RDMA, lets one machine read or write a region of another machine’s memory without involving the CPU, the operating system, or even the TCP/IP stack on the remote side. The NIC delivers data directly into the destination buffer in user memory. The latency drops to a few microseconds, and the CPU cost is close to zero. Originally developed for InfiniBand, RDMA is now also available over Ethernet.
InfiniBand is a separate networking technology, designed from the start for low-latency, high-throughput cluster communication. It supports RDMA natively, link speeds up to 400 Gbps and beyond, and very small switching latency. It is widely deployed in HPC and AI training clusters.
RoCE (RDMA over Converged Ethernet) brings RDMA semantics to Ethernet. It lets data center operators get RDMA-style low latency and low CPU overhead without building a separate InfiniBand network. The complication is that ordinary Ethernet is lossy: switches under congestion may drop packets. Packet loss is especially damaging for RDMA because recovery happens below the application and can destroy the latency assumptions that made RDMA attractive in the first place.
RoCE deployments therefore configure the fabric to behave as a nearly lossless network. The usual mechanism is Priority Flow Control (PFC), often combined with congestion-management techniques such as ECN. PFC lets switches pause selected traffic classes instead of dropping packets when buffers fill. That improves RDMA behavior but adds operational complexity: a badly tuned lossless fabric can create head-of-line blocking or congestion spreading.
NVLink and NVSwitch are scale-up GPU interconnects. They connect GPUs with much higher bandwidth and lower latency than PCIe. In many systems this means GPUs inside one server. In newer rack-scale systems, NVLink Switch extends the same idea across multiple boards or nodes inside a tightly coupled GPU rack.
This is still different from the ordinary data center network. Ethernet or InfiniBand is the scale-out network that connects many servers across racks. NVLink and NVSwitch are the high-bandwidth fabric used to make a group of GPUs behave more like one tightly coupled machine.
For top-of-rack and spine links, modern data centers run 400 Gbps or 800 Gbps Ethernet. These speeds were originally driven by HPC and AI workloads but are now common at the spine and ToR layers in general-purpose clusters as well.
High availability and failure handling
A cluster’s ability to mask failures depends on three problems being solved well: detecting failures quickly, transferring work to a survivor, and preventing the failed machine from causing damage.
Failure detection
The most basic detector is a heartbeat. Each machine sends a small message at a fixed interval to one or more peers. If no heartbeat arrives for some number of intervals, the peer is presumed dead. Tuning is delicate. A short timeout catches failures quickly but produces false positives when a network blip happens. A long timeout avoids false positives but lets failed machines hold work for too long.
Real systems use multiple checks. They check the network at several levels (link, IP, application protocol), they consult more than one peer, and they sometimes check shared storage to confirm that the suspected machine has truly stopped writing.
Failover and fencing
When a primary fails, a standby takes over. This is failover. The mechanics depend on what is being failed over: a database may need to replay its log; a service may need to claim a virtual IP; a stateful service may need a leader election to pick the new primary.
The hardest problem in failover is fencing. If the standby starts serving while the suspected-dead primary is still alive (perhaps the network failed but the primary kept running), both nodes will write to shared state and corrupt it. Fencing forcibly stops the suspected primary, by cutting its power, by disabling its storage access, or by revoking its lease, before the standby is allowed to take over.
Quorum and split brain
Recall from the lecture on consensus that a system using Paxos or Raft requires a majority of nodes (a quorum) to agree before committing any change. The same idea protects clusters from split brain: a network partition that leaves two halves of a cluster each believing it has lost contact with the other and that it should take over.
Without quorum, both halves of a partition might elect themselves leader. Both would accept writes. Reconciling the divergent state afterward is, in general, impossible. Quorum prevents this by requiring that any leader command the support of more than half the cluster. The minority side cannot make progress, so it cannot do harm.
This is why many cluster components, including etcd in Kubernetes and the Borgmaster in Borg, are run with an odd number of replicas, usually three or five. An odd number gives a clean majority and avoids ties.
Cluster scheduling: Borg
Borg is Google’s cluster management system. It was started in the early 2000s and described publicly in a 2015 paper by Verma, Pedrosa, Korupolu, Oppenheimer, Tune, and Wilkes. Before Borg, every team at Google ran its own machines, and utilization was low because each team over-provisioned for its peak. Borg’s premise was the opposite: pool the entire data center, multiplex workloads on shared machines, and let a central scheduler decide what runs where.
Borg directly influenced Kubernetes. It is also useful to read Borg alongside systems such as Mesos, YARN, and Nomad, because they solve the same basic problem: how to share a large pool of machines among many workloads. Much of the same vocabulary and many of the same trade-offs appear in modern cluster managers.
A cell and its parts
Borg organizes machines into a cell: on the order of ten thousand machines in a single data center, managed as one resource pool. A user submits work to a cell and the cell decides where to place it.
Each cell has a single replicated Borgmaster, which holds the cell’s state, accepts work from users, and decides where each task should run. The Borgmaster is replicated across five machines using Paxos, so it survives machine failures and network partitions. One replica is the elected leader; the others stand by.
Every machine in the cell runs a borglet, a local agent that starts and stops tasks, monitors their resource usage, and reports machine status to the Borgmaster. The borglet is what actually executes work.
Between the Borgmaster and the borglets are link shards, which fan out updates. Talking to ten thousand borglets directly from one process would saturate the master’s network and CPU, so the master speaks to a small number of link shards, and each link shard speaks to a slice of the borglets. This keeps the master from being a network bottleneck.
Jobs, tasks, and allocs
A user submits a job, which is a collection of identical tasks. A task is one process or process group running inside a container. A typical job description specifies how many tasks to run, a binary, a command line, and resource requests for each task: CPU cores, memory, disk, and network bandwidth.
An alloc (short for allocation) is a reserved slice of a machine’s resources in which one or more related tasks can run together. The reservation outlives any individual task: if a task crashes and is restarted, it comes back into the same alloc and reuses its local disk. Allocs also let a primary task run alongside small helper tasks (a logger, a sidecar proxy) inside one shared resource envelope on the same machine. The Kubernetes pod is the direct descendant of this idea.
An alloc set groups allocs across many machines, in the same way that a job groups tasks. This lets a service treat its reserved capacity as a unit.
Priority bands and quota
Borg supports a wide range of workloads on the same machines. A latency-sensitive web server and a nightly batch analytics job have very different needs, and Borg keeps them out of each other’s way using priorities and quotas.
Each task has an integer priority. Borg groups priorities into bands, with names roughly equivalent to monitoring, production, batch, and best-effort. Higher-priority tasks may preempt lower-priority tasks: when a production task needs resources that are currently held by a batch task, the batch task is killed and rescheduled elsewhere.
Quota is the orthogonal control. Each user (or team) has a quota for how much of each priority band they may consume. Quota prevents one team from filling the whole cell with high-priority work, even if their tasks happen to win every scheduling decision. Priority decides who wins a contest for resources. Quota decides who is allowed to enter the contest in the first place.
How scheduling works
When the Borgmaster has a pending task, the scheduler picks a machine for it in two steps. Feasibility checking filters out machines that cannot run the task: the machine lacks the requested CPU or memory, the machine does not have a needed special device, the task is forbidden from running there for policy reasons, and so on. Scoring then ranks the remaining machines and picks the best one.
Scoring is where Borg’s tuning lives. A naive score would just pick the least loaded machine, but that wastes capacity: spreading every task evenly leaves every machine with half-full resources. Borg uses a hybrid score that combines several signals: leftover capacity, the cost of preempting other tasks, and a model of how much of a machine each task is likely to actually use. The scheduler favors placements that leave large contiguous holes for future large tasks rather than fragmenting the cluster.
When no machine has free room for a high-priority task, Borg may preempt lower-priority tasks to make space. The preempted tasks go back to the pending queue and will be rescheduled.
Resource isolation
A single Borg machine can run dozens of tasks belonging to different users. Without isolation, one task could starve another of CPU, swallow all its memory, or read its files. Borg uses Linux cgroups (control groups) to enforce resource limits. Cgroups put each task into a control group and apply CPU shares, CPU quotas, memory limits, disk I/O limits, and network bandwidth limits to that group. The kernel enforces them.
Filesystem isolation comes from chroot and namespaces: each task sees a private root filesystem and cannot read or write files outside it. Network isolation comes from a private network namespace per task.
Memory has a sharper edge than CPU. CPU is time-shared, so a task that wants more CPU than its share just runs slower. Memory is space-shared, so a task that wants more memory than its limit cannot just be slowed down. The kernel enforces the memory limit by triggering the out-of-memory killer, which terminates the offending task. Borg sees the kill, reports it, and reschedules the task. From the user’s perspective, the task crashed because it exceeded its memory request.
Resource reclamation
Users tend to over-request resources because they want headroom. The result is that the requested total across all tasks is much larger than the used total at any moment. Borg measures the gap and reclaims it. The reclaimed capacity is offered to lower-priority tasks (typically batch and best-effort), which are willing to give it back if a higher-priority task suddenly needs the room. This raises overall utilization without degrading service for the high-priority workload.
Persistence and replication
The Borgmaster’s state, every job, every alloc, every assignment, lives in a Paxos-replicated store. Writes commit only when a majority of the five Borgmaster replicas have logged them. A machine failure on one replica does not lose any state. A failure of the leader triggers a re-election; the new leader picks up where the old one left off.
Borglets do not need replication. A borglet’s state is reconstructable: when it restarts, it scans the machine, finds the running tasks, and re-reports them to the Borgmaster. If a borglet dies, the Borgmaster eventually notices via missed heartbeats, considers the machine down, and reschedules its tasks elsewhere.
What Borg got right
Three lessons from Borg shaped everything that followed. First, declarative configuration scales better than imperative control: the user describes what should exist (run twenty tasks of this binary), and the system decides where to place it and how to recover from failures. Second, mixing latency-sensitive and batch workloads on the same machines raises utilization without harming the latency-sensitive workloads, as long as isolation and reclamation are good enough. Third, a strong central control plane with replicated state is workable at the scale of thousands of machines per cell, even though the abstract design suggests it should be a bottleneck.
Kubernetes
Kubernetes is the open-source system most directly influenced by Borg. It was started at Google in 2014 by people who had worked on Borg and Omega (an experimental successor inside Google) and was released as open source the same year. The goal was to bring Borg’s model out of Google and make it portable across cloud providers and on-premises hardware.
The declarative model
Kubernetes inherits Borg’s most important idea: the user submits a description of the desired state, and the system works to make actual state match. If you ask for ten replicas of a web server, Kubernetes will start ten and, if any die, start replacements. You never tell Kubernetes “start a new pod”; you tell it “I want ten pods, please make it so.”
This shifts the work from the user to a set of background loops called controllers. Each controller watches a piece of state and reconciles it. If desired says ten and actual says nine, the controller starts a tenth. If desired says ten and actual says eleven, it stops one. The reconciliation loop runs continuously, which makes the system self-healing: a node failure is no different from a desired-state change, because both leave actual state out of sync with desired state, and the controllers respond the same way to both.
Workload abstractions
From the application owner’s point of view, Kubernetes exposes a small set of objects.
A pod is the smallest deployable unit in Kubernetes. A pod is one or more containers that share a network namespace, an IP address, and storage volumes. Most pods have one container; sidecar pods (for logging, proxying, or service mesh data planes) have more. The pod is the direct descendant of Borg’s alloc.
A deployment describes a set of identical pods and a desired count. The deployment controller creates and destroys pods to keep the count right. Deployments also handle rolling updates: when the pod template changes, the controller starts new-version pods and tears down old-version pods at a controlled rate.
A service is a stable virtual IP and DNS name that routes to a set of pods. Pods come and go (their IP addresses change), but the service IP is stable, so clients inside the cluster can address the service without knowing which pods exist. The EndpointSlice controller keeps the service’s current set of backing pod addresses up to date.
Service discovery
A cluster also needs a way for services to find each other. A frontend should not contain a hard-coded list of backend IP addresses because pods and machines come and go. The cluster provides service discovery: a stable name maps to the current set of healthy endpoints.
Kubernetes does this with services and DNS. A client looks up a service name, such as orders.default.svc.cluster.local, and usually gets the service’s virtual IP. The cluster network then routes traffic sent to that address to one of the current pods behind the service. The application depends on a stable name; the cluster keeps the mapping current as pods are created, destroyed, or moved.
Service discovery is one of the pieces that makes a cluster feel like a single system. Applications address logical services, not individual machines.
Control plane components
The Kubernetes control plane has a small number of components, each with a clear job.
The API server is the only component that reads or writes persistent cluster state. Every other component, including the scheduler, the controllers, and the worker nodes, reaches state by talking to the API server. The API server validates each request, applies admission policies, and persists accepted changes.
etcd is the persistent store behind the API server. It is a Raft-replicated key-value store, typically run with three or five replicas. All cluster state, every pod definition, every secret, every config map, lives in etcd. Loss of etcd is loss of the cluster’s brain. The role etcd plays in Kubernetes is the same role Chubby plays inside Google: a small, strongly consistent store that holds the configuration and coordination data the rest of the system depends on.
The scheduler watches for pods that have been created but not yet assigned to a node. For each unassigned pod, it filters nodes by feasibility (does the node have enough free CPU and memory, does it satisfy node affinity rules, are required volumes attachable) and scores the survivors, then writes the chosen node back to the pod’s record. The kubelet on the chosen node sees the assignment and starts the pod.
The controller manager is one process that runs many controllers: the deployment controller (which ensures the right number of pods exist for each deployment), the node controller (which marks nodes unhealthy when they stop reporting), the endpoint controller (which keeps service endpoints in sync with pod IP addresses), and others. Each controller runs the same loop: watch desired state, watch actual state, take an action to close the gap.
The cloud controller manager is the bridge to a specific cloud provider. It handles tasks that depend on the cloud, such as provisioning load balancers, attaching block storage, and reading node metadata.
Node components
Each worker node runs three pieces of software.
The kubelet is the per-node agent. It watches the API server for pods assigned to its node, asks the container runtime to start them, monitors them, and reports status back. The kubelet is to a Kubernetes node what the borglet is to a Borg machine.
The container runtime is the software that actually runs containers. Modern Kubernetes uses containerd or CRI-O, with the older Docker runtime now removed. The container runtime pulls images, sets up cgroups and namespaces, starts processes, and exposes a standard interface (the Container Runtime Interface, or CRI) to the kubelet.
The kube-proxy programs the node’s networking so that requests to a service’s virtual IP are forwarded to one of the pods backing the service. It installs forwarding rules, commonly through iptables or IPVS, so that packets sent to the service IP are rewritten and delivered to a chosen pod, with no user-space proxying on the data path.
Clusters for machine learning
Training a large language model is a workload unlike anything Borg or Kubernetes was originally designed for, and it has reshaped cluster design over the last few years. It is one enormous distributed computation: thousands of GPUs repeatedly compute partial results, exchange data, and synchronize before moving to the next step.
The dominant communication pattern is all-reduce: at the end of every training step, every GPU has computed a partial gradient, and all GPUs need to exchange and sum these gradients before the next step can start. The amount of data exchanged per step can be enormous. In data-parallel training, gradients must be combined across workers. In model-parallel training, intermediate activations or partial results may move between GPUs. Either way, communication is on the critical path: every GPU waits for the slowest participant before the next step can proceed.
Three consequences follow. First, the network has to be fast and engineered to avoid packet loss on the training path. RDMA over InfiniBand or RoCE is standard, and 400 or 800 Gbps links are common at the leaf and spine. Second, GPUs in the same server are connected by NVLink and NVSwitch so that intra-server bandwidth far exceeds inter-server bandwidth, and the schedule is built to keep as much communication as possible inside one server. Third, scheduling has to be gang scheduled: either all of the GPUs a job needs are available at the same time and the job runs, or none of them run. Starting half a training job and waiting for the rest is useless, because the running half cannot make progress without its peers.
Failure handling also looks different. A long training run cannot tolerate restarting from scratch when one machine fails, so the framework writes checkpoints of the model state every few minutes, and a failure restarts the job from the most recent checkpoint. The cluster scheduler treats the entire training job as a single unit that either runs or does not, rather than as independent tasks that can recover one at a time.
Load balancing
Once a service is running on dozens or hundreds of backends, requests need to be spread across them. The component that does this is a load balancer. Its goals are distribution (no backend takes more than its share), resilience (a dead backend stops receiving requests), and sometimes locality (a request goes to a backend near the client).
Layer 4 versus layer 7
Load balancers operate at one of two layers. A layer 4 load balancer (L4) operates on TCP or UDP connections. It sees source and destination IP addresses and ports but not the request payload. L4 balancers are fast, work for any protocol that runs over TCP or UDP, and are typically used at the edge of a data center.
A layer 7 load balancer (L7) understands the application protocol, almost always HTTP or HTTPS. It can route based on URL path, HTTP header, or cookie. It can retry failed requests, enforce rate limits, and rewrite responses. To do any of this with HTTPS traffic, the balancer must be able to read the request, which means it has to perform TLS termination: the client’s TLS connection ends at the balancer, the balancer decrypts the request, makes the routing decision, and then opens a separate connection to the backend. The internal connection is sometimes unencrypted (when the back-end network is trusted) and sometimes re-encrypted with an internal certificate. Termination is what gives L7 its capabilities and is also why an L7 balancer holds the service’s TLS certificate and private key. The cost is CPU: parsing HTTP and running TLS are more expensive than forwarding TCP packets. The common pattern is to put an L4 balancer at the front (to absorb traffic and distribute connections to a fleet of L7 balancers) and an L7 balancer behind it (to do TLS termination and application-aware routing).
Common algorithms
Round robin picks each backend in turn. It is easy to implement and gives an even distribution when all backends and all requests are equal. It does not adapt to backend load, so a slow backend gets the same share of requests as a fast one and falls further behind.
Least connections picks the backend with the fewest active connections. It adapts to backends of different speeds and to long-lived connections. It works well when connections are open for similar amounts of work; it works poorly when one connection might be a quick health check and another might be a long upload.
Power of two choices picks two backends at random and sends the request to the less loaded of the two. The bound on the maximum load is exponentially better than picking one at random, while the implementation is almost as cheap. Modern proxies use this algorithm or a variant.
Consistent hashing was discussed in the lecture on content delivery networks. The idea is to hash both backends and requests onto a ring, and to route each request to the next backend clockwise on the ring. Adding or removing a backend only moves the small fraction of requests assigned to that backend, not the whole map. Consistent hashing is the standard choice for cache layers and for session-affinity routing.
Distribution at scale
Inside a single data center, one load balancer or a small set of balancers is enough. Across multiple data centers, the problem is harder, because the balancer needs to be reachable from any client and yet should send the client to a nearby data center.
Two common techniques are anycast and GeoDNS, both covered in the CDN lecture. Anycast advertises the same IP address from many data centers, and the internet’s routing protocol (BGP) sends each client to whichever data center is closest by network distance. GeoDNS returns different IP addresses to clients in different regions when they resolve a hostname, sending each client to the closest cluster’s load balancer.
DNS-based load balancing, also covered in the CDN lecture, works at a coarser level than connection-based balancing: the DNS server hands out IP addresses in a controlled rotation, and each client uses the IP it received for the lifetime of its DNS cache. It is cheap and protocol-agnostic, but slow to react to backend changes because of DNS caching.
Session affinity
Some workloads need a client’s requests to land on the same backend for the lifetime of a session. A shopping cart kept in server memory is the classic example. Session affinity (also called sticky sessions) routes a client’s requests to the same backend, usually by hashing the client’s IP or by setting a cookie that names the chosen backend. Affinity simplifies application code at the cost of less even distribution and harder failover, since a backend’s death loses every session it was holding. Stateless service design (with session state in a shared store such as Redis) is the modern alternative and avoids affinity entirely.