
Scalable Distributed Systems Design
Learn about the design assumptions, scalability principles, file system operations, and GFS architecture in building scalable distributed systems. Explore topics like fault tolerance, replication, and consistency protocols for reliable system design.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
The Google File System Some materials are from Danyang Zhuo (Duke) and MIT EECS 6.824
Design assumptions Component failures are the norm rather than the exception Files are huge (multi-GB files are common) Most files are mutated by appending new data rather than overwriting existing data Files are typically written once, then read sequentially MapReduce s input: sequential read MapReduce s output: sequential write Co-designing applications and file system API GFS has a relaxed consistency model for simpler design Atomic append to allow multiple clients to append to the same file High sustained bandwidth is more important than low latency
How to design scalable distributed systems? Scalability, performance sharding across many servers Many servers failures are commonplace Fault tolerance replication Many replicas what consistency protocol to have?
File system API Standard operations Create Delete Open Close Read Write GFS-specific operations Snapshot create a copy of a file or directory tree at low cost using copy on write Record append append a record atomically at least once at an offset of GFS s choosing
GFS architecture A single coordinator and a set of chunkservers Files are divided into fixed-size large (64MB) chunks. Each chunk has an immutable and globally unique 64-bit chunk handle assigned by the coordinator Each chunk has a version number Each chunk is stored at 3 different chunkservers by default Coordinator maintains all file system metadata Namespace, access control information, mapping from files to chunks, locations of chunks
Coordinator data File name -> array of chunk handles Like Zookeeper, no per-directory data structures Handle -> list of chunkservers, version #, primary, lease expiration Log + Checkpoints We have already seen similar designs in Raft, ZooKeeper, Primary-backup replication, and crash safe file systems
Coordinator data File name -> array of chunk handles Non-volatile Handle -> list of chunkservers, version #, primary, lease expiration Volatile Non-volatile Log + Checkpoints Non-volatile
Fast coordinator restart When log grows long, coordinator stores a checkpoint of filename- to-chunk handles mapping, other file system data to disk Checkpoints and logs are replicated on shadow coordinators On restart, coordinator reads latest checkpoint, applies log entries after checkpoint Shadow coordinators provide read access when coordinator goes down and during restart
Read workflow 1. Client sends <Name, offset> to Coordinator 2. Coordinator sends <handle, a list of chunkservers, version#> to the client, after that client caches this information 3. Client picks one chunkserver and sends <handle, offset> 1. Pick closest server to minimize network overheads 2. Check version# 4. Chunkserver reads data from disk to sends content to client
What about write/append? Use primary back replication to create a 3-copy replicated system This means clients needs to know which chunkserver is the primary Primary will order write requests from multiple clients Coordinator needs to pick a chunkserver as the primary, how? Pick one that is most up-to-date? How to tell which server is up-to-update? Is it possible that two chunkservers believe that they are the primary? When can primary reply to the client? Use lease!
Write Control and Data Flow Data is pushed along a chain of chunkservers in a pipelined fashion. Goal is to keep each chunkserver s full- duplex link fully utilized. Data is received by all replicas before write/append request is sent by client to primary. Primary forwards request to replicas and replies to client after success.
A single coordinator Can use global knowledge to do chunk placement and replication decisions Need to minimize its involvement in reads/writes so that the coordinator will not become a performance bottleneck The coordinator writes an operation log on its disk, and the log is replicated to remote machines
A single coordinator Hold metadata in memory If a client s request is read-only for metadata, then latency is low Periodically scan through its entire state for chunk garbage collection, extra replication when chunkservers fail, chunk migration to load balance disk space and chunk accesses When coordinator starts up, it polls chunkservers for chunk locations.
Lease Coordinator selects primary by giving the primary a 60 second lease. When a chunkerserver has the lease, it knows that it is the primary for the chunk. Why lease?
Record append No need to specify offset Offset is part of the return value of append GFS will append the record at least once atomically ABC ABC Primary Backup 1 Backup 2
Record append No need to specify offset Offset is part of the return value of append GFS will append the record at least once atomically ABC ABC ABC ABC ABC Primary Backup 1 Backup 2
Consistency model A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A file region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirely Client 1 write AAA, GFS returns successfully Client 2 write BBB, GFS returns successfully Can the file be ABA? in its entirely.
Snapshot: make a copy of a file Coordinator revokes lease on file to pause writes Coordinator records operation to log and updates in-memory state New snapshot file points to same chunks as old file (copy on write) On first write to a chunk C, coordinator creates a new chunk handle C , asks each chunkserver holding C to make a copy Coordinator grants a lease to one of the replicas
Garbage collection Coordinator has authoritative list of files and file-to-chunk-handle mappings On file deletion, coordinator renames files, keeps old copy for (by default) 3 days Coordinator periodically scans files for all chunk handles currently in use, passes this information to chunk servers Chunk servers delete orphaned chunks
Questions to think about Why is a single coordinator fine? This is a single point of failure: coordinator replication can solve this issue Load on a single coordinator can be too high: this may require some other techniques Why is weak consistency fine? Depends on application requirements With library support: Applications can create their own checkpoints and checksums, with library support Reader can discard padding and fragmented records