Google File System

source

The Google File System (GFS) is a scalable distributed file system designed for large data-intensive applications running on commodity hardware. Presented at SOSP 2003 by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, GFS challenged traditional file system assumptions by co-designing the storage system with Google’s specific workload characteristics.

Design Philosophy

Component failures aren't edge cases to handle—they're the steady-state condition to design for

At scale, treating failure as exceptional guarantees your system will spend most of its time in “exceptional” states.

GFS embraces component failures as the norm, not the exception. When building systems from hundreds or thousands of inexpensive commodity machines, failures of disks, memory, connectors, and entire servers are guaranteed.

The system makes four radical departures from traditional file system design:

Large Files are Standard: Multi-GB files are common, each containing many application objects. This shifts design parameters—I/O operations, block sizes, and caching strategies must be revisited when managing terabytes comprising billions of objects.

Append-Centric Workloads: Most files are mutated by appending new data rather than overwriting existing data. Random writes within files are practically non-existent. Once written, files are typically read sequentially. This access pattern makes appending the focus of performance optimization while rendering client-side caching nearly useless.

Weakening guarantees can strengthen systems when applications co-evolve with infrastructure

GFS traded POSIX compliance for atomic appends—a relaxation that actually simplified both the file system and the applications using it.

Co-Designed API: By relaxing the consistency model and introducing atomic append operations, GFS simplifies the file system without imposing onerous burdens on applications. Multiple clients can append concurrently to a file without extra synchronization.

Bandwidth Over Latency: High sustained bandwidth matters more than low latency. Applications process data in bulk at high rates rather than requiring stringent response times for individual operations.

Architecture

graph TB
    subgraph Clients
        C1[Client 1]
        C2[Client 2]
        C3[Client 3]
    end

    M[Master - Single point for metadata]

    subgraph Chunkservers
        CS1[Chunkserver 1 - Chunks: 1, 4, 7]
        CS2[Chunkserver 2 - Chunks: 1, 2, 5]
        CS3[Chunkserver 3 - Chunks: 2, 4, 8]
    end

    C1 -.->|"Metadata request"| M
    M -.->|"Chunk locations"| C1
    C1 -->|"Read/Write data"| CS1
    C1 -->|"Read/Write data"| CS2

    M -.->|"HeartBeat"| CS1
    M -.->|"HeartBeat"| CS2
    M -.->|"HeartBeat"| CS3

    style M fill:#e1f5ff
    style CS1 fill:#fff4e1
    style CS2 fill:#fff4e1
    style CS3 fill:#fff4e1

A GFS cluster consists of a single master and multiple chunkservers, accessed by multiple clients. Each typically runs as a user-level server process on commodity Linux machines.

Single Master Design

Centralization becomes a bottleneck only if you route work through the center

GFS’s single master handles metadata while data flows peer-to-peer—turning an apparent single point of failure into a coordination advantage.

Having a single master vastly simplifies design and enables sophisticated chunk placement decisions using global knowledge. However, GFS minimizes the master’s involvement in read/write operations to prevent it becoming a bottleneck.

Clients never read or write file data through the master. Instead:

sequenceDiagram
    participant Client
    participant Master
    participant Chunkserver

    Note over Client: Translate filename + byte offset<br/>to chunk index
    Client->>Master: Request metadata<br/>(filename, chunk index)
    Master->>Client: Chunk handle + replica locations
    Note over Client: Cache metadata
    Client->>Chunkserver: Read/Write data directly
    Chunkserver->>Client: Data response

The client interaction flow:

Client translates filename and byte offset into chunk index using fixed chunk size
Client sends master a request with filename and chunk index
Master replies with chunk handle and replica locations
Client caches this information
Client sends requests directly to chunkservers

This metadata caching eliminates most client-master interactions. Clients can cache chunk locations for multi-TB working sets.

Chunk Size

GFS uses 64 MB chunks, much larger than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and extended only as needed through lazy space allocation.

The large chunk size offers significant advantages:

Reduced Master Interaction: Since applications mostly read and write large files sequentially, multiple operations on the same chunk require only one initial request for chunk location information.

Persistent Connections: Clients can reduce network overhead by maintaining persistent TCP connections to chunkservers over extended periods.

Smaller Metadata: Keeping metadata in memory becomes feasible when each 64 MB chunk requires less than 64 bytes of metadata. This enables the master to perform fast periodic scans for garbage collection, re-replication, and load balancing.

The primary disadvantage—hot spots when many clients access the same small file—proved minimal in practice. Google’s workloads consist primarily of large multi-chunk files read sequentially.

Metadata Management

The master stores three major types of metadata:

File and chunk namespaces
Mapping from files to chunks
Locations of each chunk’s replicas

All metadata resides in the master’s memory. The first two types persist through mutation logging to an operation log stored on local disk and replicated to remote machines. Using a log enables simple, reliable updates without risking inconsistencies during master crashes.

In-Memory Metadata

Storing metadata in memory makes master operations fast and enables efficient periodic scanning for chunk garbage collection, re-replication during chunkserver failures, and chunk migration for load balancing.

Memory requirements remain modest: less than 64 bytes per 64 MB chunk, and less than 64 bytes per file using prefix compression for filenames. Adding memory to the master is a small price for the simplicity, reliability, performance, and flexibility gained.

Chunk Locations

The best way to keep distributed state consistent is to not distribute it at all

By never persisting chunk locations, GFS makes the source of truth (the chunkserver’s disk) the only truth—eliminating an entire class of synchronization bugs.

The master does not persistently store which chunkservers have replicas of each chunk. It polls chunkservers at startup and monitors status through regular HeartBeat messages thereafter.

This design decision eliminates synchronization problems as chunkservers join, leave, fail, and restart—events that happen frequently in clusters with hundreds of servers.

Operation Log

The operation log contains a historical record of critical metadata changes and serves as the logical timeline defining the order of concurrent operations. Files, chunks, and their versions are uniquely identified by the logical times at which they were created.

Since the operation log is critical, GFS:

Replicates it on multiple remote machines
Responds to client operations only after flushing log records both locally and remotely
Batches several log records together before flushing to reduce impact on throughput

The master checkpoints its state whenever the log grows beyond a certain size. Checkpoints use a compact B-tree form directly mappable into memory for namespace lookup without parsing. Recovery requires only loading the latest checkpoint and replaying subsequent log records.

Consistency Model

GFS implements a relaxed consistency model that supports highly distributed applications while remaining simple to implement. Understanding these guarantees is essential for building applications on GFS.

File Namespace Mutations

File namespace mutations (creation, deletion) are atomic. The master’s namespace locking guarantees atomicity and correctness. The operation log defines a global total order of these operations.

File Region States

After a data mutation, a file region can be in several states:

Consistent: All clients always see the same data, regardless of which replicas they read from.

Defined: Region is consistent AND clients see what the mutation wrote in its entirety.

Inconsistent: Different clients may see different data at different times.

graph TD
    M[Mutation Type]
    M --> W[Write]
    M --> RA[Record Append]

    W --> WS{Success?}
    WS -->|"Yes, Serial"| D1[Defined]
    WS -->|"Yes, Concurrent"| U1[Undefined but Consistent]
    WS -->|"No"| I1[Inconsistent]

    RA --> RAS{Success?}
    RAS -->|"Yes"| D2[Defined Interspersed with Inconsistent]
    RAS -->|"No"| I2[Inconsistent]

    style D1 fill:#90EE90
    style D2 fill:#90EE90
    style U1 fill:#FFD700
    style I1 fill:#FFB6C1
    style I2 fill:#FFB6C1

Serial Write Success: Region is defined—all clients see what the mutation wrote.

Concurrent Writes: Region is undefined but consistent—all clients see the same data, but it may contain mingled fragments from multiple mutations.

Failed Mutation: Region is inconsistent—different clients may see different data.

Record Append

Giving up control over where data lands can paradoxically give you better concurrency guarantees

By letting GFS choose write offsets, record append enables lock-free concurrent writes—trading positional determinism for parallelism.

GFS’s record append operation guarantees atomicity even with concurrent mutations. The offset where data is written is chosen by GFS (not the application) and returned to the client.

This enables multiple clients to append to the same file concurrently without additional locking, perfect for implementing producer-consumer queues and many-way merges. GFS may insert padding or record duplicates between records—regions occupied by these are considered inconsistent but typically dwarfed by user data.

Maintaining Guarantees

GFS achieves its consistency guarantees through:

Ordered Application: Applying mutations to chunks in the same order on all replicas
Version Numbers: Using chunk version numbers to detect stale replicas that missed mutations while their chunkserver was down
Checksumming: Detecting data corruption through regular checksums
Re-replication: Restoring data from valid replicas as soon as problems surface

Stale replicas are never involved in mutations or given to clients requesting chunk locations. They’re garbage collected at the earliest opportunity.

Application Implications

GFS applications accommodate the relaxed consistency model through techniques already needed for other purposes:

Rely on Appends: Applications mutate files by appending rather than overwriting. Writers generate files from beginning to end, then atomically rename to a permanent name after writing all data. Checkpoints may mark how much has been successfully written. Readers verify and process only up to the last checkpoint.

Checkpointing: Allows writers to restart incrementally and prevents readers from processing incomplete data. Checkpoints may include application-level checksums.

Self-Validating Records: Records contain checksums so readers can verify validity. Readers identify and discard padding and fragments. If occasional duplicates are intolerable, applications filter them using unique identifiers in records.

This approach—relying on appends, checkpointing, and self-validating records—makes writes far more efficient and resilient to application failures than random writes.

Production Deployment

By 2003, multiple GFS clusters were deployed for different purposes at Google. The largest clusters had:

Over 1000 storage nodes
Over 300 TB of disk storage
Hundreds of concurrent clients on distinct machines

The system successfully provided the storage platform for data generation and processing used by Google services, as well as research and development efforts requiring large datasets.

The lessons from GFS influenced a generation of distributed storage systems, including Hadoop’s HDFS, and demonstrated that matching system design to actual workload characteristics often matters more than adhering to traditional abstractions.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Google File System

Design Philosophy

Architecture

Single Master Design

Chunk Size

Metadata Management

In-Memory Metadata

Chunk Locations

Operation Log

Consistency Model

File Namespace Mutations

File Region States

Record Append

Maintaining Guarantees

Application Implications

Production Deployment

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Kubernetes Batch Jobs

Graph View

Table of Contents