Software Overheads in Ceph Storage

Software Overheads of Storage in Ceph
Akila Nagamani
Aravind Soundararajan
Krishnan Rajagopalan
1
Motivation
I/O was the main bottleneck for applications
Has been the topic of research for decades
Now is the era of faster storage 
100X faster than the traditional spinning disks
Where should the next focus be ? What is the dominant part now ?
Software aka Kernel
2
What overheads exist?
The processing in the software layers of storage introduce overheads.
A “write” given in an application is not directly issued to the device
A software storage layer called “File System” processes it
The filesystem operations include
Metadata management
Journaling / Logging
Compaction / Compression
Caching
Distributed data management
3
Why BlueStore? Why not FileStore?
Filestore has the problem of double writes.
Every write is journaled by Ceph backend for consistency.
The underlying filesystem performs additional journaling.
Hence, an analysis of software overheads would point to the “journal of journals” problem
which is already evident.
BlueStore does not have this problem
The objects/data are directly written onto raw block storage.
C
e
p
h
 
m
e
t
a
d
a
t
a
 
i
s
 
w
r
i
t
t
e
n
 
t
o
 
R
o
c
k
s
D
B
 
u
s
i
n
g
 
a
 
l
i
g
h
t
w
e
i
g
h
t
 
B
l
u
e
F
S
.
BlueStore has relatively lesser intermediate software layers to storage and hence studying the
overheads has the potential for interesting inferences.
4
Write data flow in BlueStore
O
S
D
:
:
S
h
a
r
d
e
d
O
p
W
Q
W
A
L
B
l
u
e
S
t
o
r
e
.
k
v
_
q
u
e
u
e
W
r
i
t
e
 
d
a
t
a
 
t
o
 
d
i
s
k
,
a
s
y
n
c
B
l
u
e
S
t
o
r
e
.
b
d
e
v
_
a
i
o
_
t
h
r
e
a
d
W
A
L
A
p
p
l
y
 
W
A
L
 
t
o
 
d
i
s
k
,
a
s
y
n
c
B
l
u
e
S
t
o
r
e
:
:
W
A
L
W
Q
B
l
u
e
S
t
o
r
e
.
k
v
_
s
y
n
c
_
t
h
r
e
a
d
R
o
c
k
s
D
B
 
m
e
t
a
d
a
t
a
w
r
i
t
e
W
A
L
B
l
u
e
S
t
o
r
e
.
f
i
n
i
s
h
e
r
s
C
l
i
e
n
t
Y
e
s
Y
e
s
Y
e
s
N
o
N
o
N
o
5
Experiment
We used ‘perf’ tool to collect the traces of our write operation and ‘FlameGraph’ for the analysis of
our traces.
The write workload size was varied from 4K upto 512K. 
We used ramdisk to simulate an infinitely fast storage.
Single cluster configuration was used
Network processing overheads are not considered.
No replication for objects
We don’t aim to study reliability
We categorize each of the major functions encountered in our traces into two sets. (Motivated by the
paper “The Importance! of Non-Data Touching Processing Overheads in TCP/IP”)
Data touching: Methods that depend on the input data size.
Non - data touching: Methods that are independent of input data size.
6
Flame Graph for a write size of 4K
7
Major functions identified by the Flamegraph
The time spent by the “Swapper” is
very high in case of HDD.
This shows that Ceph performs
synchronous IO - makes sense to
achieve “reliability”
The time spent in “Swapper” can
be viewed as the IO time.
8
Non data touching overhead - Journaling
Assumption: Ceph journals metadata
and it shouldn’t depend on data size
Expectation: Constant time for any data
size
Observation: Two regions of constant
time
< 64 K: longer time in journaling
Data journaling happens for small
writes < min_alloc_size
9
Non data touching overhead
Ceph performs other non-data touching operations
RocksDB compaction
Socket overhead
These overheads were too small to be analyzed.
10
Data touching overhead
Ceph performs the following operations on data
CRC calculation
Zero-filling
These overheads were too small (0.01 % of time) to be analyzed
Problem for data > 5 MB in ramdisk (1 GB)
Sage feels that the backend is saturated.
11
Conclusion
Ceph Bluestore is better tuned for faster storage than Filestore.
Journaling is the major software overhead added by the storage layer.
The only way to avoid it is by trading consistency -> Might not be suitable for Ceph
Data touching overheads in Ceph are very small.
Storage layer has more non data touching overheads than data touching overheads.
This is in contrast to the network layer
The extra overhead caused by software storage layer is to give additional consistency and reliability
guarantees.
Could be avoided if unnecessary
12
THANK YOU !
QUESTIONS?
13
Slide Note
Embed
Share

Exploring the impact of software layers on storage performance in Ceph, focusing on the transition to faster storage technologies like BlueStore, and analyzing the write data flow and experimental findings related to write operations.

  • Ceph Storage
  • Software Overheads
  • BlueStore
  • Write Data Flow
  • Performance Analysis

Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Software Overheads of Storage in Ceph Akila Nagamani Aravind Soundararajan Krishnan Rajagopalan 1

  2. Motivation I/O was the main bottleneck for applications Has been the topic of research for decades Now is the era of faster storage 100X faster than the traditional spinning disks Where should the next focus be ? What is the dominant part now ? Software aka Kernel 2

  3. What overheads exist? The processing in the software layers of storage introduce overheads. A write given in an application is not directly issued to the device A software storage layer called File System processes it The filesystem operations include Metadata management Journaling / Logging Compaction / Compression Caching Distributed data management 3

  4. Why BlueStore? Why not FileStore? Filestore has the problem of double writes. Every write is journaled by Ceph backend for consistency. The underlying filesystem performs additional journaling. Hence, an analysis of software overheads would point to the journal of journals problem which is already evident. BlueStore does not have this problem The objects/data are directly written onto raw block storage. Ceph metadata is written to RocksDB using a lightweight BlueFS. BlueStore has relatively lesser intermediate software layers to storage and hence studying the overheads has the potential for interesting inferences. 4

  5. Write data flow in BlueStore BlueStore.kv_queue OSD::ShardedOpWQ BlueStore.kv_sync_thread Yes RocksDB metadata write WAL No No Write data to disk, async No Yes WAL WAL BlueStore.finishers Yes BlueStore.bdev_aio_thread Client BlueStore::WALW Q Apply WAL to disk, async 5

  6. Experiment We used perf tool to collect the traces of our write operation and FlameGraph for the analysis of our traces. The write workload size was varied from 4K upto 512K. We used ramdisk to simulate an infinitely fast storage. Single cluster configuration was used Network processing overheads are not considered. No replication for objects We don t aim to study reliability We categorize each of the major functions encountered in our traces into two sets. (Motivated by the paper The Importance! of Non-Data Touching Processing Overheads in TCP/IP ) Data touching: Methods that depend on the input data size. Non - data touching: Methods that are independent of input data size. 6

  7. Flame Graph for a write size of 4K 7

  8. Major functions identified by the Flamegraph The time spent by the Swapper is very high in case of HDD. This shows that Ceph performs synchronous IO - makes sense to achieve reliability The time spent in Swapper can be viewed as the IO time. 8

  9. Non data touching overhead - Journaling Assumption: Ceph journals metadata and it shouldn t depend on data size Expectation: Constant time for any data size Observation: Two regions of constant time < 64 K: longer time in journaling Data journaling happens for small writes < min_alloc_size 9

  10. Non data touching overhead Ceph performs other non-data touching operations RocksDB compaction Socket overhead These overheads were too small to be analyzed. 10

  11. Data touching overhead Ceph performs the following operations on data CRC calculation Zero-filling These overheads were too small (0.01 % of time) to be analyzed Problem for data > 5 MB in ramdisk (1 GB) Sage feels that the backend is saturated. 11

  12. Conclusion Ceph Bluestore is better tuned for faster storage than Filestore. Journaling is the major software overhead added by the storage layer. The only way to avoid it is by trading consistency -> Might not be suitable for Ceph Data touching overheads in Ceph are very small. Storage layer has more non data touching overheads than data touching overheads. This is in contrast to the network layer The extra overhead caused by software storage layer is to give additional consistency and reliability guarantees. Could be avoided if unnecessary 12

  13. THANK YOU ! QUESTIONS? 13

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#