Understanding Software Overheads in Ceph Storage

Slide Note
Embed
Share

Exploring the impact of software layers on storage performance in Ceph, focusing on the transition to faster storage technologies like BlueStore, and analyzing the write data flow and experimental findings related to write operations.


Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Software Overheads of Storage in Ceph Akila Nagamani Aravind Soundararajan Krishnan Rajagopalan 1

  2. Motivation I/O was the main bottleneck for applications Has been the topic of research for decades Now is the era of faster storage 100X faster than the traditional spinning disks Where should the next focus be ? What is the dominant part now ? Software aka Kernel 2

  3. What overheads exist? The processing in the software layers of storage introduce overheads. A write given in an application is not directly issued to the device A software storage layer called File System processes it The filesystem operations include Metadata management Journaling / Logging Compaction / Compression Caching Distributed data management 3

  4. Why BlueStore? Why not FileStore? Filestore has the problem of double writes. Every write is journaled by Ceph backend for consistency. The underlying filesystem performs additional journaling. Hence, an analysis of software overheads would point to the journal of journals problem which is already evident. BlueStore does not have this problem The objects/data are directly written onto raw block storage. Ceph metadata is written to RocksDB using a lightweight BlueFS. BlueStore has relatively lesser intermediate software layers to storage and hence studying the overheads has the potential for interesting inferences. 4

  5. Write data flow in BlueStore BlueStore.kv_queue OSD::ShardedOpWQ BlueStore.kv_sync_thread Yes RocksDB metadata write WAL No No Write data to disk, async No Yes WAL WAL BlueStore.finishers Yes BlueStore.bdev_aio_thread Client BlueStore::WALW Q Apply WAL to disk, async 5

  6. Experiment We used perf tool to collect the traces of our write operation and FlameGraph for the analysis of our traces. The write workload size was varied from 4K upto 512K. We used ramdisk to simulate an infinitely fast storage. Single cluster configuration was used Network processing overheads are not considered. No replication for objects We don t aim to study reliability We categorize each of the major functions encountered in our traces into two sets. (Motivated by the paper The Importance! of Non-Data Touching Processing Overheads in TCP/IP ) Data touching: Methods that depend on the input data size. Non - data touching: Methods that are independent of input data size. 6

  7. Flame Graph for a write size of 4K 7

  8. Major functions identified by the Flamegraph The time spent by the Swapper is very high in case of HDD. This shows that Ceph performs synchronous IO - makes sense to achieve reliability The time spent in Swapper can be viewed as the IO time. 8

  9. Non data touching overhead - Journaling Assumption: Ceph journals metadata and it shouldn t depend on data size Expectation: Constant time for any data size Observation: Two regions of constant time < 64 K: longer time in journaling Data journaling happens for small writes < min_alloc_size 9

  10. Non data touching overhead Ceph performs other non-data touching operations RocksDB compaction Socket overhead These overheads were too small to be analyzed. 10

  11. Data touching overhead Ceph performs the following operations on data CRC calculation Zero-filling These overheads were too small (0.01 % of time) to be analyzed Problem for data > 5 MB in ramdisk (1 GB) Sage feels that the backend is saturated. 11

  12. Conclusion Ceph Bluestore is better tuned for faster storage than Filestore. Journaling is the major software overhead added by the storage layer. The only way to avoid it is by trading consistency -> Might not be suitable for Ceph Data touching overheads in Ceph are very small. Storage layer has more non data touching overheads than data touching overheads. This is in contrast to the network layer The extra overhead caused by software storage layer is to give additional consistency and reliability guarantees. Could be avoided if unnecessary 12

  13. THANK YOU ! QUESTIONS? 13

More Related Content