Overview of Ceph: A Scalable Distributed File System
Ceph is a high-performance distributed file system known for its excellent performance, reliability, and scalability. It decouples metadata and data operations, leverages OSD intelligence for complexity distribution, and utilizes adaptive metadata cluster architecture. Ceph ensures the separation of metadata management and data storage, dynamic distributed metadata management, and reliable distributed object storage. Clients perform file I/O by communicating directly with OSDs, and Ceph provides various striping strategies for mapping file data onto objects.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Ceph: A Scalable, High-Performance Distributed File System 2020. 04. 27. Dept. of Computer Engineering Dankook University Young-Yun Baek InformationArchitecture InformationArchitecture Page 1
Context System Overview Client Operation Dynamically Distributed Metadata Distributed Object Storage Performance and Scalability Evaluation InformationArchitecture Page 2
System Overview A distributed file system that provides excellent performance and reliability while promising unparalleled scalability. Performance - Decouples metadata and data operations. Reliability - Leverage the intelligence present in OSDs to distribute the complexity. Scalability - Utilizes a highly adaptive distributed metadata cluster architecture that dramatically improves the scalability. InformationArchitecture Page 3
Performan ce Reliability Scalability InformationArchitecture Page 4
Clients, Metadata Cluster, Object-Storage Devices(OSD Cluster) Figure 1 Systemarchitecture. ClientsperformfileI/Oby communicatingdirectlywith OSDs.Eachprocesscan eitherlinkdirectlytoa clientinstance,orinteract withamountedfilesystem. InformationArchitecture Page 5
System Overview Decoupled Data and Metadata Ceph maximizes the separation of file system metadata management from the storage of file data. Dynamic Distributed Metadata Management Ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning Reliable Autonomic Distributed Object Storage Ceph delegates responsibility for data migration, replication, failure detection, and failure recovery to the cluster of OSDs InformationArchitecture Page 6
Client Operation File I/O and Capabilities Ceph generalizes a range of striping strategies to map file data onto a sequence of objects. Object replicas are then assigned to OSDs using CRUSH, a globally known mapping function Client Synchronization Synchronization model thus retains its simplicity by providing correct read- write and shared-write semantics between clients via synchronous I/O. Namespace Operations Both read operations and updates are synchronously applied by the MDS. (readdir, stat, unlink, chmod) InformationArchitecture Page 7
Dynamically Distributed Metadata File and directory metadata in Ceph is very small, consisting almost entirely of directory entries and inodes. Unlike conventional file systems, no file allocation metadata is necessary Object names are constructed using the inode number, and distributed to OSDs using CRUSH. InformationArchitecture Page 8
Dynamically Distributed Metadata Metadata Storage A set of large, bounded, lazily flushed journals allows each MDS to quickly stream its updated metadata to disk in an efficient and distributed manner. Dynamic Subtree Partitioning MDS cluster is based on a dynamic subtree partitioning strategy. Each MDS measures the popularity of metadata within the directory hierarchy using counters with an exponential time decay. Any operation increments the counter on the affected inode and all of its ancestors up to the root directory, providing each MDS with a weighted tree describing the recent load distribution. InformationArchitecture Page 9
MDS load values are periodically compared, and appropriately-sized subtrees of the directory hierarchy are seamlessly migrated to keep the workload evenly distributed. The resulting subtree based partition is kept coarse to minimize prefix replication overhead and to preserve locality. Figure 2 dynamicallymaps Ceph subtreesofthedirectory hierarchytometadataservers basedonthecurrent workload.Individual directoriesarehashedacross multiplenodesonlywhen theybecomehotspots. InformationArchitecture Page 10
Dynamically Distributed Metadata Traffic Control Ceph uses its knowledge of metadata popularity to provide a wide distribution for hot spots only when needed and without incurring the associated overhead and loss of directory locality in the general case. The contents of heavily read directories are selectively replicated across multiple nodes to distribute load. InformationArchitecture Page 11
Distributed Object Storage Data Distribution with CRUSH Ceph s Reliable Autonomic Distributed Object Store (RADOS) achieves linear scaling in both capacity and aggregate performance and recovery to OSDs in a distributed fashion. Ceph first maps objects into placement groups (PGs) using a simple hash function, with an adjustable bit mask to control the number of PGs. Placement groups are then assigned to OSDs using CRUSH (Controlled Replication Under Scalable Hashing). InformationArchitecture Page 12
File to Object: Takes the inode (from MDS) File to PGs: Object ID and number of PG s PGs to OSD: PG ID and Cluster Map (from OSD) Figure 3 Filesarestripedacross manyobjects,groupedinto placementgroups(PGs),and distributedtoOSDsvia CRUSH,aspecializedreplica placementfunction. InformationArchitecture Page 13
Distributed Object Storage Replication Data is replicated in terms of placement groups, each of which is mapped to an ordered list of n OSDs. Data Safety RADOS allowing Ceph to realize both low-latency updates for efficient application synchronization and well-defined data safety semantics. InformationArchitecture Page 14
Replicates to nodes also in the PG Write the PG primary (from CRUSH function) Primary OSD replicates to other OSD s in the Placement Group Commit update only after the longest update Figure 4 RADOSrespondswithan ack afterthewritehas beenappliedtothebuffer cachesonallOSDs replicatingtheobject.Only afterithasbeensafely committedtodiskisa finalcommitnotificationsent totheclient. InformationArchitecture Page 15
Distributed Object Storage Failure Detection Ceph distributes liveness verification by having each OSD monitor a pseudo- random subset of its peers. Recovery and Cluster Updates The OSD cluster map will change due to OSD failures, recoveries, and explicit cluster changes such as the deployment of new storage. Object Storage with EBOFS Each Ceph OSD manages its local object storage with EBOFS, an Extent and B-tree based Object File System. InformationArchitecture Page 16
Performance and Scalability Evaluation Data Performance OSD Throughput [figure 5, figure 6] Write Latency [figure 7] Data Distribution and Scalability [figure 8] Metadata Performance Metadata Update Latency [figure 9a] Metadata Read Latency [figure 9b] Metadata Scaling [figure 10, figure 11] InformationArchitecture Page 17
Figure 5 Pre-OSDwriteperformance. Thehorizontallineindicates theupperlimitimposedby thephysicaldisk.Data replicationhasminimal impactonOSDthroughput, althoughifthenumberof OSDsisfixed,n-way replicationreducestotal effectivethroughputbya factorofnbecause replicateddatamustbe writtentonOSDs. Figure 6 PerformanceofEBOFS comparedtogeneral Althoughsmallwritessuffer fromcoarselockinginour prototype,EBOFSnearly saturatesthediskforwrites largerthan32KB.Since EBOFSlaysoutdatain largeextentswhenitis writteninlargeincrements, ithassignificantlybetter readperformance. InformationArchitecture Page 18
Figure 7 Writelatencyforvarying writesizesandreplication. Morethantworeplicas incursminimaladditional costforsmallwrites becausereplicatedupdates occurconcurrently.Forlarge synchronouswrites, transmissiontimesdominate. Clientspartiallymaskthat latencyforwritesover128 KBbyacquiringexclusive locksandasynchronously flushingthedata. Figure 6 OSDwriteperformance scaleslinearlywiththesize oftheOSDclusteruntil theswitchissaturatedat 24OSDs.CRUSHandhash performanceimproveswhen morePGslowervariancein OSDutilization. InformationArchitecture Page 19
Figure 9 Theuseofalocaldisk lowersthewritelatencyby avoidingtheinitialnetwork round-trip.Readsbenefit fromcachingaswell,while readdirplus orrelaxed consistencyeliminatesMDS interactionforstatsfollowing readdir. InformationArchitecture Page 20
Figure 10 Per-MDSthroughputunder avarietyofworkloadsand clustersizes.Asthecluster growsto128nodes, efficiencydropsnomore than50%belowperfectly linear(horizontal)scalingfor mostworkloads,allowing vastlyimprovedperformance overexistingsystems. Figure 11 Averagelatencyversus averageper-MDSthroughput fordifferentclustersizes (makedirs work-laod) InformationArchitecture Page 21
Reference https://www.slideserve.com/desmond/ceph-a-scalable-high-performance-distributed- file-system-powerpoint-ppt-presentation https://www.kdata.or.kr/info/info_04_view.html?field=&keyword=&type=techreport&pa ge=116&dbnum=146422&mode=detail&type=techreport InformationArchitecture InformationArchitecture Page 22
Q & A InformationArchitecture InformationArchitecture Page 23