Dynamo Architecture

source

Synthesis Over Innovation

Dynamo’s power comes not from inventing new techniques but from the thoughtful orchestration of existing ones into a coherent whole.

Dynamo’s architecture represents the synthesis of multiple Distributed Systems techniques, each addressing a specific challenge in building a highly available key-value store. The true innovation lies not in individual components, but in how they work together as a coherent system.

Architectural Components

Problem	Technique	Key Advantage
Partitioning	Consistent Hashing with Virtual Nodes	Incremental scalability, even load distribution
High availability for writes	Vector Clocks with read-time reconciliation	Version size decoupled from update rates
Temporary failures	Sloppy Quorums and Hinted Handoff	Availability maintained when replicas unavailable
Permanent failures	Anti-Entropy Protocols using Merkle Trees	Efficient background replica synchronization
Membership & failure detection	Gossip Protocols	No centralized registry, preserves symmetry

The Data Flow

Write Path

sequenceDiagram
    participant C as Client
    participant Coord as Coordinator Node
    participant R1 as Replica 1
    participant R2 as Replica 2

    C->>Coord: put(key, value)
    Coord->>Coord: Generate new vector clock
    Coord->>Coord: Write locally
    par Parallel replication
        Coord->>R1: Replicate with vector clock
        Coord->>R2: Replicate with vector clock
    end
    R1-->>Coord: ACK
    R2-->>Coord: ACK
    Note over Coord: Wait for W responses
    Coord-->>C: Success (when W nodes confirm)

Any node can receive a client request and act as coordinator
Coordinator generates a new vector clock for the version
Writes locally and sends to N-1 successors in the preference list
Returns success when W nodes acknowledge (including itself)

Read Path

sequenceDiagram
    participant C as Client
    participant Coord as Coordinator
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3

    C->>Coord: get(key)
    par Request all replicas
        Coord->>R1: Read request
        Coord->>R2: Read request
        Coord->>R3: Read request
    end
    R1-->>Coord: Version v1 [clock: (A,2)]
    R2-->>Coord: Version v2 [clock: (A,2)(B,1)]
    R3-->>Coord: Version v3 [clock: (A,2)(C,1)]
    Note over Coord: Wait for R responses
    Coord->>Coord: Detect v2 and v3 are conflicting
    Coord-->>C: Return all versions + context
    C->>C: Semantic reconciliation
    C->>Coord: put(key, merged_value, context)

Coordinator requests all versions from N highest-ranked reachable nodes
Waits for R responses before returning
If multiple causally unrelated versions exist, returns all for application reconciliation

Tunable Parameters: N, R, W

The three core parameters allow fine-grained control over system behavior:

N (Replication Factor): Total nodes storing each data item

Higher N = better durability, more storage cost
Typical: N=3

R (Read Quorum): Minimum nodes that must respond to reads

Higher R = stronger consistency, slower reads
Lower R = faster reads, potential stale data

W (Write Quorum): Minimum nodes that must acknowledge writes

Higher W = higher durability, slower writes
Lower W = faster writes, risk of data loss on failures

Configuration Patterns

Quorum guarantee (R + W > N): Read and write sets overlap, ensuring reads see latest writes

Common: (N=3, R=2, W=2) - balanced approach

High-performance reads (R=1, W=N): Read from any single replica

Use case: Read-heavy workloads with infrequent updates
Trade-off: Slower writes, faster reads

High-performance writes (R=N, W=1): Write to single node, read from all

Use case: Write-heavy workloads requiring strong read consistency
Trade-off: Faster writes, slower reads

graph TD
    A[Choose Configuration] --> B{Workload Pattern?}
    B -->|Balanced| C[N=3, R=2, W=2<br/>Quorum guarantee]
    B -->|Read-heavy| D[N=3, R=1, W=3<br/>Fast reads]
    B -->|Write-heavy| E[N=3, R=3, W=1<br/>Fast writes]
    B -->|Maximum availability| F[N=3, R=1, W=1<br/>Accept inconsistency]

The Ring: Partitioning and Placement

Consistent Hashing with Virtual Nodes forms the foundation of Dynamo’s partitioning strategy:

graph LR
    subgraph Hash Ring
    A[Node A<br/>tokens: 40°, 120°, 280°]
    B[Node B<br/>tokens: 80°, 200°, 340°]
    C[Node C<br/>tokens: 20°, 160°, 320°]
    end

    K[Key 'cart-123'<br/>hash: 95°] --> B
    K2[Key 'user-456'<br/>hash: 145°] --> C

Preference list construction:

For key K, find coordinator via hash position
Walk clockwise to find next N-1 distinct physical nodes
Skip virtual nodes belonging to same physical server
Result: N distinct physical nodes responsible for key K

Handling Failures

Temporary Failures: Availability First

Sloppy Quorums and Hinted Handoff ensure writes never fail even when primary replicas are unavailable. The system uses the first N healthy nodes rather than strictly the first N nodes on the ring.

Permanent Failures: Consistency Eventually

Anti-Entropy Protocols using Merkle Trees run continuously in the background to detect and repair divergent replicas. Efficient tree-based comparison minimizes network and disk I/O by transferring only divergent keys.

Membership and Failure Detection

Gossip Protocols provide decentralized coordination for membership propagation and failure detection. Administrators explicitly add/remove nodes, and changes propagate via periodic peer-to-peer communication, creating an eventually consistent view. Seed nodes prevent logical partitions, and local failure detection allows nodes to route around unresponsive peers without global consensus.

graph TB
    subgraph Gossip Round
        N1[Node 1] -->|gossip| N3[Node 3]
        N2[Node 2] -->|gossip| N4[Node 4]
        N3 -->|gossip| N2
        N4 -->|gossip| Seed[Seed Node]
    end

    Seed -.->|Eventually consistent<br/>membership view| N1
    Seed -.-> N2
    Seed -.-> N3
    Seed -.-> N4

Operational Simplicity Trumps Elegance

Dynamo abandoned its original elegant partitioning scheme for fixed-size partitions—sometimes the best design is the most operationally boring one.

Evolution: Partitioning Strategies

Dynamo’s partitioning approach evolved through three strategies:

Strategy 1 (Initial): T random tokens per node, partition by token value

Problem: Partitioning coupled with placement, slow bootstrapping

Strategy 2 (Interim): T random tokens per node, equal-sized partitions

Problem: Worst load balancing efficiency

Strategy 3 (Final): Q/S tokens per node, equal-sized partitions

Q = total partitions, S = number of nodes
Advantages: Best load balancing, faster bootstrapping/recovery, simpler archival
Fixed partition ranges transfer as single files

Integration Patterns

The architecture supports multiple use patterns:

Business logic reconciliation: Applications perform semantic merging (shopping carts)

Configuration: (N=3, R=2, W=2)

Timestamp reconciliation: Datastore uses “last write wins”

Use case: Session state, caches
Configuration: (N=3, R=2, W=2) with timestamp-based conflict resolution

High-performance read engine: Replicated cache

Configuration: (N=3, R=1, W=3)
Trade-off: Fast reads, slower writes

Key Architectural Insights

No single point of failure: Every component is decentralized—partitioning, replication, failure detection, membership
Symmetry simplifies operations: All nodes have identical responsibilities
Tunable trade-offs: Applications control consistency/availability/performance balance
Optimistic replication: Accept writes immediately, reconcile conflicts later
Push complexity to reads: Write path is fast and always available; read path handles reconciliation

The architecture demonstrates that eventual consistency can be a practical foundation for large-scale systems when availability is paramount.

See Dynamo - Highly Available Key-Value Store for the broader context and design philosophy.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Dynamo Architecture

Architectural Components

The Data Flow

Write Path

Read Path

Tunable Parameters: N, R, W

Configuration Patterns

The Ring: Partitioning and Placement

Handling Failures

Temporary Failures: Availability First

Permanent Failures: Consistency Eventually

Membership and Failure Detection

Evolution: Partitioning Strategies

Integration Patterns

Key Architectural Insights

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Kubernetes Batch Jobs

Graph View

Table of Contents

Backlinks