Understanding Distributed File Systems
A distributed file system manages files across multiple machines on a network, providing users with a seamless experience as if they were using a local file system. This system abstracts details such as file locations, replicas, and system failures from the user, ensuring efficient and reliable file management.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
14-736: DISTRIBUTED SYSTEMS LECTURE 13 * SPRING 2020 * KESDEN
WHAT ARE DISTRIBUTED FILE SYSTEMS? Maybe we d better start with what are file systems? Maybe we d better start with what are files? A file is a unit of data organized by the user. The data within a file isn't necessarily meaningful to the operating system. Instead, a file is created by the user and meaningful to the user. It is the job of the file system to maintain this unit, without understanding or caring why. Contrast this with a record in a database, which is described by a schema and upon which the database enforces constraints
WHAT ARE DISTRIBUTED FILE SYSTEMS? Okay. We got files. So, what are file systems? A file system is a service responsible for managing files. File systems typically implement persistent storage, although volatile file systems are also possible (/proc is such an example).
WHAT DOES IT MEAN TO MANAGE FILES? Name files in meaningful ways. The file system should allow a user to locate a file using a human-friendly name. In many cases, this is done using a hierarchical naming scheme like those we know and love in Windows, UNIX, Linux, &c. Access files. Create, destroy, read, write, append, truncate, keep track of position within, &c Physical allocation. Decide where to store things. Reduce fragmentation. Keep related things close together, &c. Security and protection. Ensure privacy, prevent accidental (or malicious) damage, &c. Resource administration. Enforce quotas, implement priorities, &c.
WHAT ARE DISTRIBUTED FILE SYSTEMS? Got files. Got file systems. Well, it is a type of file system, which means that it is a file system. The distinction isn't what it is or what it does but the environment in which it does it. A traditional file system typically has all of the users and all of the storage resident on the same machine. A distributed file system typically operates in an environment where the data may be spread out across many, many hosts on a network -- and the users of the system may be equally distributed.
GOAL OF A GENERAL PURPOSE DISTRIBUTED FILE SYSTEM (DFS) A general purpose distributed file system should appear to the user as if a local file system Users shouldn t have to worry about details such as Where files are located How to identify the locations How many replicas there are If the replicas are up to date What happens if part of the system fails, e.g. partitioning, host failure, etc.
WHY MIGHT WE WANT A GENERAL PURPOSE DISTRIBUTED FILE SYSTEM? More storage More throughput More robustness from host and storage failure More robust from environmental hazards Faster access, e.g. closer to users Etc.
RECALL, FOR EXAMPLE, UNIX- STYLE FILE SYSTEMS
UNIX-LIKE FILE SYSTEMS, CONTINUED Virtual File System (VFS) File system, inode are base classes Known as Virtual file system (VFS) and vnode Specific implementation is derived type Implementation may be legit OO, e.g. C++ Or, implementation may, for example, ugly C with unions, etc. Implementation abstracted from interface Distributed File Systems can be just an alternative implementation for storage
GENERAL PURPOSE DISTRIBUTED FILE SYSTEM Maintain same naming system Mount in UNIX Map in Windows, etc Just change storage layer to use network and manage errors
NETWORK FILE SYSTEM (NFS) EARLY VERSIONS Not a DFS Simply central storage accessed via network Client made requests on per-block basis Stateless server No client caching Turned out to be too painful Clients cached anyway Just accepted staleness for a while since no way to validate Evolved over time into DFS with support for caching
ANDREW FILE SYSTEM (AFS) OUR FRIEND Stateful server Callback mechanism Invalidates client caches upon change, but in use copies unaffected Whole-file semantics Modified to move blocks in version 2.x (when it left CMU and went to IBM)
CODA Extension of AFS Project Added replication Added weakly connected mode: One server replicates on behalf of client Added disconnected mode: Uploaded and resolved by client later Hoard demon: Pull whole files needed for later Spy: Figure out which files to hoard. Client-side resolution Insert CVV discussion again here
ASIDE: RAIDS Redundant Array of Inexpensive Disks Original goal: Use cheap disks for robust bulk storage Redundant Array of Independent Disks Disks aren t really differentiated by reliability anymore Goal: More robust Goal: Larger volume Goal: Higher throughput Big idea: Spread data over multiple drives in parallel to get higher throughput while using parity for robustness.
ASIDE: RAID LEVELS Raid 0: Stripe data across drives for improved throughput Raid 4: Blockwise parity improves performance No extra redundancy, elevated risk Raid 5: Rotating parity block among disks relieves bottleneck Raid 1: Mirroring, parallel data to multiple devices for robustness Raid 6: Raid5 + dual parity. No extra throughput Supports up to 2 HDD failures Raid 2: Use Hamming codes for parity Slow rebuilds Requires log2 parity bits Really expensive Raid 10: Raid 1 + 0 Raid 3: Bitwise parity on parity disk. Raid 50: Raid 5 + 0 Requires only one parity disk for N storage disks. Bitwise parity is slow, dedicated parity disk is bottleneck
LUSTRE: LINUX CLUSTER Developed by Peter Braam, at the time a researcher at CMU Cluster File Systems, Inc. Sun Oracle Open Used for high-performance clusters Divides and conquers like RAIDs for throughput Pairings for reliability
LUSTRE: LINUX CLUSTER http://wiki.lustre.org/File:Lustre_File_System_Overview_(DNE)_lowres_v1.png
MOGILEFS Think about sharing sites, e.g. video, photo, etc Uploads No editing Whole file delivery, no random access Managed by software, not humans No need for hierarchical namespace Protections enforced by application, not FS Goal: Fast delivery to clients
MOGILEFS Replicated storage Class determines number of replicas HTTP + MySQL for portability Namespaces vs directory tree Think albums, etc. Portable: User-level Perl implementation Any lower-level file system
HDFS ARCHITECTURE BASED UPON: HTTP://HADOOP.APACHE.ORG/DOCS/R3.0.0-ALPHA1/HADOOP- PROJECT-DIST/HADOOP-HDFS/HDFSDESIGN.HTML
HDFS ARCHITECTURE GREGORY KESDEN, CSE-291 (STORAGE SYSTEMS) FALL 2017 BASED UPON: HTTP://HADOOP.APACHE.ORG/DOCS/R3.0.0-ALPHA1/HADOOP-PROJECT-DIST/HADOOP-HDFS/HDFSDESIGN.HTML
ASSUMPTIONS At scale, hardware failure is the norm, not the exception Continued availability via quick detection and work-around, and eventual automatic rull recovery is key Applications stream data for batch processing Not designed for random access, editing, interactive use, etc Emphasis is on throughput, not latency Large data sets Tens of millions of files many terabytes per instance
ASSUMPTIONS, CONTINUED Simple Coherency Model = Lower overhead, higher throughput Write Once, Read Many (WORM) Gets rid of most concurrency control and resulting need for slow, blocking coordination Moving computation is cheaper than moving data The data is huge, the network is relatively slow, and the computation per unit of data is small. Moving (Migration) may not be necessary mostly just placement of computation Portability, even across heterogeneous infrastructure At scale, things can be different, fundamentally, or as updates roll-out
NAMENODE Master-slave architecture 1x NameNode (coordinator) Manages name space, coordinates for clients Directory lookups and changes Block to DataNode mappings Files are composed of blocks Blocks are stored by DataNodes Note: User data never comes to or from a NameNode. The NameNode just coordinates
DATANODE Many DataNodes (participants) One per node in the cluster. Represent the node to the NameNode Manage storage attached to node Handles read(), write() requests, etc for clients Store blocks as per NameNode Create and Delete blocks, Replicate Blocks
NAMESPACE Hierarchical name space Directories, subdirectories, and files Managed by NameNode Maybe not needed, but low overhead Files are huge and processed in entirety Name to block lookups are rare Remember, model is streaming of large files for processing Throughput, not latency, is optimized
ACCESS MODEL (Just to be really clear) Read anywhere Streaming is in parallel across blocks across DataNodes Write only at end (append) Delete whole file (rare) No edit/random write, etc
REPLICATION Blocks are replicated by default Blocks are all same size (except tail) Fault tolerance Opportunities for parallelism NameNode managed replication Based upon heartbeats, block reports (per dataNode report of available blocks), and replication factor for file (per file metadata)
LOCATION AWARENESS Site + 3-Tier Model is default
REPLICA PLACEMENT AND SELECTION Assume bandwidth within rack greater than outside of rack Default placement 2 nodes on same rack, one different rack (Beyond 3? Random, below replicas/rack limit) Fault tolerance, parallelism, lower network overhead than spreading farther Read from closest replica (rack, site, global)
FILESYSTEM METADATA PERSISTENCE EditLog keeps all metadata changes. Stored in local host FS FSImage keeps all FS metadata Also stored in local host FS FSImage kept in memory for use Periodically (time interval, operation count), merges in changes and checkpoints Can truncate EditLog via checkpoint Multiple copies of files can be kept for robustness Kept in sync Slows down, but okay given infrequency of metadata changes.
FAILURE OF DATANODES Disk Failure, Node Failure, Partitioning Detect via heartbeats (long delay, by default), blockmaps, etc Re-Replicate Corruption Detectable by client via checksums Client can determine what to do (nothing is an option) Metadata
DATABLOCKS, STAGING Data blocks are large to minimize overhead for large files Staging Initial creation and writes are cached locally and delayed, request goes to NameNode when 1st chunk is full. Local caching is intended to support use of memory hierarchy and throughput needed for streaming. Don t want to block for remote end. Replication is from replica to replica, Replication pipeline Maximizes client s ability to stream