Software Overheads in Ceph Storage

Software Overheads of Storage in Ceph

Akila Nagamani

Aravind Soundararajan

Krishnan Rajagopalan

Motivation

●

I/O was the main bottleneck for applications

○

Has been the topic of research for decades

●

Now is the era of faster storage

○

100X faster than the traditional spinning disks

●

Where should the next focus be ? What is the dominant part now ?

○

Software aka Kernel

What overheads exist?

●

The processing in the software layers of storage introduce overheads.

●

A “write” given in an application is not directly issued to the device

○

A software storage layer called “File System” processes it

●

The filesystem operations include

○

Metadata management

○

Journaling / Logging

○

Compaction / Compression

○

Caching

○

Distributed data management

Why BlueStore? Why not FileStore?

●

Filestore has the problem of double writes.

○

Every write is journaled by Ceph backend for consistency.

○

The underlying filesystem performs additional journaling.

○

Hence, an analysis of software overheads would point to the “journal of journals” problem

which is already evident.

●

BlueStore does not have this problem

○

The objects/data are directly written onto raw block storage.

○

○

BlueStore has relatively lesser intermediate software layers to storage and hence studying the

overheads has the potential for interesting inferences.

Write data flow in BlueStore

Experiment

●

We used ‘perf’ tool to collect the traces of our write operation and ‘FlameGraph’ for the analysis of

our traces.

●

The write workload size was varied from 4K upto 512K.

●

We used ramdisk to simulate an infinitely fast storage.

●

Single cluster configuration was used

○

Network processing overheads are not considered.

●

No replication for objects

○

We don’t aim to study reliability

●

We categorize each of the major functions encountered in our traces into two sets. (Motivated by the

paper “The Importance! of Non-Data Touching Processing Overheads in TCP/IP”)

○

Data touching: Methods that depend on the input data size.

○

Non - data touching: Methods that are independent of input data size.

Flame Graph for a write size of 4K

Major functions identified by the Flamegraph

●

The time spent by the “Swapper” is

very high in case of HDD.

●

This shows that Ceph performs

synchronous IO - makes sense to

achieve “reliability”

●

The time spent in “Swapper” can

be viewed as the IO time.

Non data touching overhead - Journaling

●

Assumption: Ceph journals metadata

and it shouldn’t depend on data size

●

Expectation: Constant time for any data

size

●

Observation: Two regions of constant

time

○

< 64 K: longer time in journaling

○

Data journaling happens for small

writes < min_alloc_size

Non data touching overhead

●

Ceph performs other non-data touching operations

○

RocksDB compaction

○

Socket overhead

●

These overheads were too small to be analyzed.

Data touching overhead

●

Ceph performs the following operations on data

○

CRC calculation

○

Zero-filling

●

These overheads were too small (0.01 % of time) to be analyzed

●

Problem for data > 5 MB in ramdisk (1 GB)

○

Sage feels that the backend is saturated.

Conclusion

●

Ceph Bluestore is better tuned for faster storage than Filestore.

●

Journaling is the major software overhead added by the storage layer.

○

The only way to avoid it is by trading consistency -> Might not be suitable for Ceph

●

Data touching overheads in Ceph are very small.

●

Storage layer has more non data touching overheads than data touching overheads.

○

This is in contrast to the network layer

●

The extra overhead caused by software storage layer is to give additional consistency and reliability

guarantees.

○

Could be avoided if unnecessary

THANK YOU !

QUESTIONS?

Slide Note

Embed Share

Download

Exploring the impact of software layers on storage performance in Ceph, focusing on the transition to faster storage technologies like BlueStore, and analyzing the write data flow and experimental findings related to write operations.

pemberton_g Follow

Uploaded on Oct 06, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Software Overheads of Storage in Ceph Akila Nagamani Aravind Soundararajan Krishnan Rajagopalan 1

Motivation I/O was the main bottleneck for applications Has been the topic of research for decades Now is the era of faster storage 100X faster than the traditional spinning disks Where should the next focus be ? What is the dominant part now ? Software aka Kernel 2

What overheads exist? The processing in the software layers of storage introduce overheads. A write given in an application is not directly issued to the device A software storage layer called File System processes it The filesystem operations include Metadata management Journaling / Logging Compaction / Compression Caching Distributed data management 3

Why BlueStore? Why not FileStore? Filestore has the problem of double writes. Every write is journaled by Ceph backend for consistency. The underlying filesystem performs additional journaling. Hence, an analysis of software overheads would point to the journal of journals problem which is already evident. BlueStore does not have this problem The objects/data are directly written onto raw block storage. Ceph metadata is written to RocksDB using a lightweight BlueFS. BlueStore has relatively lesser intermediate software layers to storage and hence studying the overheads has the potential for interesting inferences. 4

Write data flow in BlueStore BlueStore.kv_queue OSD::ShardedOpWQ BlueStore.kv_sync_thread Yes RocksDB metadata write WAL No No Write data to disk, async No Yes WAL WAL BlueStore.finishers Yes BlueStore.bdev_aio_thread Client BlueStore::WALW Q Apply WAL to disk, async 5

Experiment We used perf tool to collect the traces of our write operation and FlameGraph for the analysis of our traces. The write workload size was varied from 4K upto 512K. We used ramdisk to simulate an infinitely fast storage. Single cluster configuration was used Network processing overheads are not considered. No replication for objects We don t aim to study reliability We categorize each of the major functions encountered in our traces into two sets. (Motivated by the paper The Importance! of Non-Data Touching Processing Overheads in TCP/IP ) Data touching: Methods that depend on the input data size. Non - data touching: Methods that are independent of input data size. 6

Flame Graph for a write size of 4K 7

Major functions identified by the Flamegraph The time spent by the Swapper is very high in case of HDD. This shows that Ceph performs synchronous IO - makes sense to achieve reliability The time spent in Swapper can be viewed as the IO time. 8

Non data touching overhead - Journaling Assumption: Ceph journals metadata and it shouldn t depend on data size Expectation: Constant time for any data size Observation: Two regions of constant time < 64 K: longer time in journaling Data journaling happens for small writes < min_alloc_size 9

Non data touching overhead Ceph performs other non-data touching operations RocksDB compaction Socket overhead These overheads were too small to be analyzed. 10

Data touching overhead Ceph performs the following operations on data CRC calculation Zero-filling These overheads were too small (0.01 % of time) to be analyzed Problem for data > 5 MB in ramdisk (1 GB) Sage feels that the backend is saturated. 11

Conclusion Ceph Bluestore is better tuned for faster storage than Filestore. Journaling is the major software overhead added by the storage layer. The only way to avoid it is by trading consistency -> Might not be suitable for Ceph Data touching overheads in Ceph are very small. Storage layer has more non data touching overheads than data touching overheads. This is in contrast to the network layer The extra overhead caused by software storage layer is to give additional consistency and reliability guarantees. Could be avoided if unnecessary 12

THANK YOU ! QUESTIONS? 13

Software Overheads in Ceph Storage

Download Presentation

Presentation Transcript

Related

More Related Content