pk.org: CS 417/Lecture Notes

Coordination Services and Network-Attached Storage - Study Guide

Paul Krzyzanowski – 2026-03-03

Coordination Services

Distributed systems frequently need to answer questions like: who is the current leader? Is this lock held? What is the current configuration? Getting these answers wrong can be catastrophic. If two nodes both believe they are the leader, they will independently accept writes and the system state will diverge.

The natural approach is a dedicated replicated coordinator made fault-tolerant through consensus. A coordination service is exactly that: a small, strongly consistent, highly available store that distributed applications use to coordinate decisions and share small amounts of control-plane state. Every coordination service we study – Chubby, ZooKeeper, and etcd – uses consensus internally. This is not incidental: the problems they solve are precisely the problems that require consensus.

What Coordination Services Provide

All three services share a core set of capabilities:

Chubby

Chubby is a lock service and configuration store from Google. A Chubby deployment is a cell of five replicas, one of which is the master elected via Paxos. The other replicas participate in consensus but redirect client requests to the master. Three of five replicas must be alive for the cell to function, allowing two simultaneous failures.

Chubby exposes a file system interface: locks, configuration data, and service addresses are all stored as named files in a hierarchical namespace. Locks are advisory and coarse-grained, meaning they are held for long periods and are not enforced by the system if code chooses to ignore them.

When a client opens a file it receives a lease – a time-bounded guarantee that its cached copy is valid. The master sends cache invalidations to other clients when a file is written. If the master fails, a new one is elected, broadcasts a new epoch, and gives clients a grace period to reconnect. Clients that do not reconnect within the grace period have their sessions and locks released.

ZooKeeper

ZooKeeper was developed at Yahoo and is open-source. Rather than providing locks as a primitive, it provides building blocks from which locks, leader election, barriers, and other coordination patterns can be constructed.

Data is stored in a tree of znodes, each holding a small amount of data. There are two types of znodes:

  1. Persistent znodes survive client disconnection and remain until explicitly deleted.

  2. Ephemeral znodes are automatically deleted when the client session that created them ends. This is the key mechanism for failure detection.

Either type can optionally be created as a sequential znode, which causes ZooKeeper to append a monotonically increasing integer to the name. This is essential for implementing locks without thundering-herd problems.

The thundering herd problem occurs when many clients are all waiting for the same condition and a single state change wakes them all simultaneously. They all rush to retry at once, creating a burst of load on the coordination service, while only one of them can make progress. In a ZooKeeper lock implementation, the fix is to have each waiting client watch only its immediate predecessor in the queue rather than the lock node itself. When the lock is released, exactly one client is notified instead of all of them.

ZooKeeper uses a consensus protocol similar to Raft to replicate writes through a leader in a globally consistent order. Reads can be served by any replica and are sequentially consistent; clients can issue a sync to ensure a replica has caught up to recently committed writes before reading.

Watches are one-shot notifications. A client sets a watch when it reads a znode (checking existence, data, or children), and ZooKeeper delivers an event if the relevant state changes. The watch fires once and is then removed; the client must re-register if it wants continued monitoring. The usual pattern is to treat the event as “something changed,” re-read the current state from ZooKeeper, and re-register the watch. This keeps clients consistent even if multiple changes occur quickly or during a brief disconnection.

etcd

etcd was created by CoreOS and is the authoritative store for Kubernetes cluster state. It uses Raft for consensus. Unlike ZooKeeper’s hierarchical namespace, etcd stores a flat key-value map: keys are arbitrary byte strings, and hierarchy is a naming convention, not an enforced structure. Prefix range queries and prefix watches provide directory-like behavior.

etcd routes reads through the leader by default, giving linearizable reads – reads that reflect the most recently committed write, as if the entire system had a single consistent view at that instant. Serializable reads (served locally by any replica) are available as an opt-in for workloads that can tolerate slight staleness. Leases with TTLs provide the same self-cleaning behavior as ZooKeeper’s ephemeral znodes. Transactions with compare-and-swap conditions enable atomic leader election and lock acquisition.

Common Coordination Patterns

These patterns apply to all three coordination services; only the specific primitives differ.

Leader election. Replicas contend for a well-known name in the coordination service. Exactly one wins. The winner’s claim disappears when its session expires, allowing others to contend again.

Distributed locks. Acquiring the lock is a write through consensus, giving global ordering. Locks built on ephemeral nodes or leases are self-cleaning: a crashed holder’s session expires and the lock releases automatically. Coordination services are suited to coarse-grained locks: locks held for long periods protecting large resources (a master election, a configuration update). They are not suited to fine-grained locks held for milliseconds to protect individual rows or records. High-frequency lock acquisitions and releases would overwhelm a system built around consensus.

Configuration management. Services store configuration in the coordination service. Updates go through consensus and are applied consistently. Clients watch configuration keys for changes.

Service discovery. A running instance registers its address under a known prefix using an ephemeral key. The list of active instances stays current because dead servers’ keys expire automatically.

Fencing tokens. A monotonically increasing number associated with each lock grant. The protected resource rejects any request carrying a token lower than the highest it has seen, preventing a stale lock holder that woke up after a pause from corrupting shared state.

What Coordination Services Do Not Do

A coordination service stores small amounts of metadata. It is not a database, not a message queue, and not suitable for data-plane operations or high-throughput writes. A useful rule of thumb: if the data is on the critical path of every client request, it does not belong in a coordination service.

More fundamentally, electing a leader through a coordination service does not guarantee that the application as a whole behaves correctly. Coordination serializes decisions; application correctness is still the developer’s responsibility.


Network-Attached Storage

Access Transparency and VFS

The goal of networked file systems is access transparency: applications use standard file system calls (open, read, write, close) against remote files without any awareness that the storage is remote. This is achieved through the Virtual File System (VFS) layer, adopted by every major Unix-derived OS. VFS defines a standard interface that any file system driver must implement. The kernel always talks to this interface; whether the driver beneath it issues disk commands or sends network requests is invisible to applications.

Mount points attach different file systems into a single directory tree. A remote file system client is a VFS driver that translates standard file operations into network requests. When the response arrives, the result passes back through the VFS interface to the application.

Design Dimensions

Every networked file system must navigate three fundamental tradeoffs:

Consistency. Multiple clients may cache the same file. Keeping those caches consistent requires either frequent polling against the server or a protocol where the server pushes invalidations to clients.

State. A stateless server holds no information about client activity between requests. Every request is self-contained. Crash recovery is trivial because there is nothing to recover. But statelessness makes locks, open file tracking, and cache invalidation impossible. A stateful server enables richer semantics at the cost of recovery complexity: after a crash, open files, locks, and cached state must be rebuilt or cleaned up.

Caching. Options range from write-through (immediate server update), to write-behind (delayed batch), to write-on-close / session semantics (send changes only on close). Callbacks, where the server tracks which clients have cached a file and pushes invalidations on modification, require statefulness but eliminate polling.

NFS

NFS was designed to be simple, stateless, and interoperable across any networked system. It was built on openly published RPC and data encoding standards and was ported to many operating systems in both client and server roles.

Because the server is stateless, NFSv2 has no OPEN, CLOSE, LOCK, SEEK, or APPEND procedures. Clients identify files by file handles: opaque server-generated identifiers that persist across server restarts. The server uses UDP as the transport because the stateless, idempotent design makes retries safe.

Key limitations of stateless NFS:

NFS caches data in blocks and validates cached data using timestamp comparison: the client checks the file’s modification time on the server when a file is opened, and after a short validity timeout. This gives close-to-open consistency: stale reads are possible between opens.

AFS

AFS was designed to fix the scalability problem of NFS. Workload measurements showed that most file accesses are reads, files are usually accessed by one user at a time, and most files are small enough to cache entirely. This motivated the upload/download model and whole-file caching: when a file is opened, the entire file is downloaded to the client’s local disk. Reads and writes operate on the local copy. On close, if modified, the file is uploaded back. This gives session semantics: changes are visible to other clients only after the file is closed.

The mechanism that makes aggressive caching safe is the callback promise: when the server delivers a file to a client, it promises to notify the client if the file is modified. When a client uploads a modified file, the server sends callback revocations to all other clients that hold the file. Those clients invalidate their cached copies. Because files are read far more than they are written, most accesses proceed from the local cache with no server interaction at all.

AFS enforces a uniform global namespace: all AFS content appears under /afs on every client machine, with the cell name (e.g., cs.rutgers.edu) as the second path component. The same path resolves to the same file regardless of which client machine the user is on. NFS has no such guarantee; administrators mount remote directories at arbitrary local paths. File system content is organized into volumes that administrators can move between servers transparently via referrals.

Coda

Coda extended AFS to support laptops and mobile workstations that might lose network connectivity. Key concepts:

AFS and Coda are no longer widely deployed. AFS survives at some universities and research institutions, but its operational complexity and aging authentication model have made it difficult to justify in new deployments. Coda remained a research prototype.

SMB

Microsoft’s Server Message Block protocol was designed with the opposite philosophy from NFS: stateful, connection-oriented, and built to enforce Windows file-sharing semantics. SMB tracks every open file, every lock, and every byte range under lock at the server. This enabled mandatory locking, byte-range locks, and the semantics Windows applications expected. The cost was that server crashes lost all session state.

Opportunistic locks (oplocks) give the server a way to grant clients caching rights. The server monitors file access and sends an oplock break to the caching client when a conflict arises, requiring it to flush writes before the server allows the competing open. This is the same idea as AFS callbacks, applied at finer granularity. Later versions of Windows generalized oplocks into leases with cleaner semantics that can also cover directory metadata.

SMB 2 dramatically modernized the protocol with several performance improvements:

SMB 3 added high-availability and datacenter features: Transparent Failover lets a client survive the failure of one node in a clustered file server without losing open files or locks, and SMB Multichannel allows a session to use multiple network interfaces simultaneously for throughput and redundancy. SMB 3 also added protocol-level encryption.

macOS adopted SMB 2 as its default file sharing protocol (replacing AFP). macOS also supports NFS for Unix-oriented environments, but SMB is the default.

NFSv4

NFSv4 abandoned statelessness. Clients now open and close files explicitly, and the server tracks state. Key improvements over NFSv2/v3:

The Convergence

The key mechanisms that modern NFS and SMB both now provide, starting from very different origins:

Mechanism NFS v2/v3 NFSv4 SMB 1 SMB 2+
Stateful server No Yes Yes Yes
Compound/pipelined requests No Yes No Yes
Client caching grants No Yes (delegations) Yes (oplocks) Yes (oplocks + leases)
Server-to-client notification No Yes Yes Yes
Referrals No Yes Yes (via DFS) Yes (via DFS)
Strong authentication Optional Mandatory NTLM/Kerberos Kerberos/NTLMv2
Transport UDP or TCP TCP only TCP TCP

NFS is dominant in Linux, Unix, and HPC environments. SMB is dominant in Windows enterprise environments and is the default on macOS.

Microsoft’s referral support comes via DFS (Distributed File System), a separate namespace service that has worked alongside SMB since the late 1990s. DFS maps logical paths to physical server locations and issues referrals when clients access those paths. It is not specific to SMB 2 or later; it predates SMB 2 and works across SMB versions.

Consistency Semantics Summary


What You Do Not Need to Memorize


Next: Terms you should know

Back to CS 417 Documents