Evolution of Ceph's Storage Backends: Lessons Learned

Slide Note
Embed
Share

Embedded System Lab explores the challenges of building storage backends on local file systems, focusing on the evolution of Ceph's storage backend over a decade. The presentation delves into the complexities of distributed storage systems like Ceph, highlighting difficulties in leveraging efficient transactions and the evolution of storage architectures. Key topics include BlueStore, Ceph's distributed storage system architecture, and the essential aspects of distributed storage backends.


Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Embedded System Lab. File Systems Unfit as Distributed Storage Backends: Lesson from 10 Years of Ceph Evolution A. Aghayev et al., SOSP, 2019 2020. 04. 27 Presentation by Sopanhapich CHUM sopanhapich.chum@gmail.com

  2. Embedded System Lab. Content 1. Introduction 2. Background 3. Building Storage Backend on Local File Systems is Hard 4. BlueStore 5. Evaluation 6. Conclusion 7. References 2

  3. 1. Introduction Embedded System Lab. Introduction Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability. 3

  4. 1. Introduction Embedded System Lab. Introduction 4

  5. 2. Background Embedded System Lab. Essential of Distributed Storage Backends Distribute file systems aggregate storage space from multiple physical machine into a single unified data store High bandwidth and parallel I/O, horizontal scalability fault tolerance and strong consistency. 5

  6. 2. Background Embedded System Lab. Ceph Distributed Storage System Architecture Ceph Client Hash Function CRUSH 6

  7. 2. Background Embedded System Lab. Evolution of Ceph s Storage Backend 7

  8. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Why It s Hard ? 8

  9. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Challenge 1: Efficient Transactions Leveraging File System Internal Transactions If a Ceph OSD encountered a fatal event in middle of a transaction Introducing a single system call and implementing rollback through snapshots Btrfs authors deprecated transaction system calls Leverage NTFS s in-kernel transaction framework It was deprecated due to its high barrier to entry 9

  10. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Challenge 1: Efficient Transactions Implementing the WAL in User Space 1 - Slow Read-Modify-Write Ceph perform many read-modify-write operations on objects WAL Perform 3 Steps for every transaction Cannot read until the third step completes Full latency of the WAL commit 10

  11. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Challenge 1: Efficient Transactions Implementing the WAL in User Space 2 - Non-Idempotent Operations 1- Update A; 2- Rename B C; 3- Rename A B; 4- Update D WAL corrupts object A FileStore/Btrfs : periodically taking persistent snapshots of the file system and marking the WAL position at the time of snapshot. On XFS the sync system call in the only option for synchronizing file system stat to storage On XFS no option to restore a file system to a specific state 11

  12. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Challenge 1: Efficient Transactions Implementing the WAL in User Space 3 - Double Write Data is written twice: first to the WAL and then to the file system First writing it to disk and then logging only the respective metadata Implementing an in-memory cache for data and metadata to any updates waiting on the WAL 12

  13. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Challenge 1: Efficient Transactions Using a Key-Value Store as the WAL NewStore, the metadata was stored in RocksDB First, slow read-modify write operation are avoided Second, the problem of non-operation replay is avoided Finally, the problem of double writhed is avoided Create an object in NewStore results in four expensive flush commands to disk 13

  14. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Challenge 2: Fast Metadata Operations FileStore, readdir operations on large directories and the lack of ordering in the returned result Directories are kept small by splitting them when the number of entries in them grows The number of objects grows, directory contents spread out and split operations take longer due to seeks As a result, when all Ceph OSDs start splitting in unison the performance suffers 14

  15. 3. Building Storage Backend on Local File Systems is Hard Embedded System Lab. Challenge 2: Fast Metadata Operation 15

  16. 4. BlueStore Embedded System Lab. BlueStore Main goals of BlueStore were: Fast metadata operations No consistency overhead for object writes Copy-on-write clone operation No journaling double-writes Optimized I/O patterns for HDD and SSD 16

  17. 4. BlueStore Embedded System Lab. BlueFS and RocksDB Fast metadata operations BlueStore storing metadata in RocksDB No consistency overhead for object writes First, it writes data directly to raw disk Second, it changes RocksDB to reuse WAL files as a circular buffer 17

  18. 4. BlueStore Embedded System Lab. Data Path and Space Allocation When writes larger than a minimum allocation size : The data is written to a newly allocation extent BlueStore provide an efficient clone operation Avoid journal double write When writes smaller than a minimum allocation size: Data and Metadata are first inserted to RocksDB Then asynchronously written to disk Optimize I/O 18

  19. 4. BlueStore Embedded System Lab. Data Path and Space Allocation Space Allocation : FreeList manager Allocator Cache BlueStore access the disk using direct I/O it cannot leverage the OS page cache BlueStore implement write-through cache in user space (2Q algorithm) 19

  20. 5. Evaluation Embedded System Lab. 20

  21. 5. Evaluation Embedded System Lab. 21

  22. 5. Evaluation Embedded System Lab. 22

  23. 5. Evaluation Embedded System Lab. 23

  24. 5. Evaluation Embedded System Lab. 24

  25. 6. Conclusion Embedded System Lab. Conclusion Distributed storage systems Uniformly adopt file systems as backend Work around limitations accidental complexity, lost performance Ceph experience 8 years wrestling file systems 2 years to build a new backend Build a custom backend and win performance back. 25

  26. 7. References Embedded System Lab. References https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/storage_strategies_guide/erasure_code_pools https://events.static.linuxfound.org/sites/events/files/slides/20170323%20bluestore.pdf https://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf https://docs.ceph.com/docs/master/ 26

  27. Embedded System Lab. Ceph: A scalable, High-Performance Distributed File System Sage A.Weil et al., OSDI, 2006 Thank You! 2020. 01. 29 Presentation by sopanhapich CHUM sopanhapich.chum@gmail.com

More Related Content