Overview of Ceph Distributed File System

undefined
C
E
P
H
:
A
 
S
C
A
L
A
B
L
E
,
H
I
G
H
-
P
E
R
F
O
R
M
A
N
C
E
D
I
S
T
R
I
B
U
T
E
D
 
F
I
L
E
 
S
Y
S
T
E
M
S. A. Weil,  S. A. Brandt,   E. L. Miller
D. D. E. Long,  C. Maltzahn
U. C. Santa Cruz
O
S
D
I
 
2
0
0
6
Paper highlights
Yet another distributed file system using
object storage devices
D
e
s
i
g
n
e
d
 
f
o
r
 
s
c
a
l
a
b
i
l
i
t
y
Main contributions
1.
D
i
s
t
r
i
b
u
t
e
d
 
d
y
n
a
m
i
c
 
m
e
t
a
d
a
t
a
 
m
a
n
a
g
e
m
e
n
t
 
t
h
r
o
u
g
h
h
a
s
h
i
n
g
2.
P
s
e
u
d
o
-
r
a
n
d
o
m
 
d
a
t
a
 
d
i
s
t
r
i
b
u
t
i
o
n
 
f
u
n
c
t
i
o
n
 
r
e
p
l
a
c
e
s
o
b
j
e
c
t
 
l
i
s
t
s
System objectives
Excellent performance and reliability
U
n
p
a
r
a
l
l
e
l
 
s
c
a
l
a
b
i
l
i
t
y
 
t
h
a
n
k
s
 
t
o
Distribution of metadata workload inside metadata cluster
Use of object storage devices (OSDs)
Designed for very large systems
Petabyte scale (10
6
 gigabytes)
Characteristics of very large systems
Built incrementally
Node failures are the norm
Quality and character of workload changes over time
System overview
System architecture
Key ideas
Decoupling data and metadata
Metadata management
Autonomic distributed object storage
System Architecture (I)
System Architecture (II)
C
l
i
e
n
t
s
Export a near-POSIX file system interface
C
l
u
s
t
e
r
 
o
f
 
O
S
D
s
Store all data and metadata
Communicate directly with clients
M
e
t
a
d
a
t
a
 
s
e
r
v
e
r
 
c
l
u
s
t
e
r
M
a
n
a
g
e
s
 
t
h
e
 
n
a
m
e
s
p
a
c
e
 
(
f
i
l
e
s
 
a
n
d
 
d
i
r
e
c
t
o
r
i
e
s
)
Security, consistency and coherence
Key ideas
Separate data and metadata management tasks
- Metadata cluster does not have object lists
Dynamic partitioning of metadata data tasks inside
metadata cluster
A
v
o
i
d
s
 
h
o
t
 
s
p
o
t
s
Let OSDs handle file migration and replication tasks
Decoupling data and metadata
Metadata cluster handles metadata  operations
Clients interact directly with OSD for all file I/O
Low-level bloc allocation is delegated to OSDs
Other OSD still require metadata cluster to hold object lists
C
e
p
h
 
u
s
e
s
 
a
 
s
p
e
c
i
a
l
 
p
s
e
u
d
o
-
r
a
n
d
o
m
 
d
a
t
a
 
d
i
s
t
r
i
b
u
t
i
o
n
f
u
n
c
t
i
o
n
 
(
C
R
U
S
H
)
Old School
M
e
t
a
d
a
t
a
s
e
r
v
e
r
c
l
u
s
t
e
r
C
l
i
e
n
t
F
i
l
e
 
x
y
z
?
W
h
e
r
e
 
t
o
 
f
i
n
d
 
t
h
e
c
o
n
t
a
i
n
e
r
 
o
b
j
e
c
t
s
M
e
t
a
d
a
t
a
 
s
e
r
v
e
r
 
k
e
e
p
s
t
r
a
c
k
 
o
f
 
t
h
e
 
l
o
c
a
t
i
o
n
s
o
f
 
a
l
l
 
c
o
n
t
a
i
n
e
r
s
Ceph with CRUSH
M
e
t
a
d
a
t
a
s
e
r
v
e
r
c
l
u
s
t
e
r
C
l
i
e
n
t
F
i
l
e
 
x
y
z
?
H
o
w
 
t
o
 
f
i
n
d
 
t
h
e
c
o
n
t
a
i
n
e
r
 
o
b
j
e
c
t
s
C
l
i
e
n
t
 
u
s
e
s
 
C
R
U
S
H
 
a
n
d
d
a
t
a
 
p
r
o
v
i
d
e
d
 
 
b
y
 
M
D
S
c
l
u
s
t
e
r
 
t
o
 
f
i
n
d
 
t
h
e
 
f
i
l
e
Ceph with CRUSH
M
e
t
a
d
a
t
a
s
e
r
v
e
r
c
l
u
s
t
e
r
C
l
i
e
n
t
F
i
l
e
 
x
y
z
?
H
e
r
e
 
i
s
 
h
o
w
 
t
o
 
f
i
n
d
 
t
h
e
s
e
c
o
n
t
a
i
n
e
r
 
o
b
j
e
c
t
s
C
l
i
e
n
t
 
u
s
e
s
 
C
R
U
S
H
 
a
n
d
d
a
t
a
 
p
r
o
v
i
d
e
d
 
 
b
y
 
M
D
S
c
l
u
s
t
e
r
 
t
o
 
f
i
n
d
 
t
h
e
 
f
i
l
e
Metadata management
D
y
n
a
m
i
c
 
S
u
b
t
r
e
e
 
P
a
r
t
i
t
i
o
n
i
n
g
Lets Ceph dynamically share metadata  workload among tens
or hundreds of metadata servers (MDSs)
Sharing is dynamic and based on current access patterns
Results in near-linear performance scaling in the number of MDSs
Autonomic distributed object storage
Distributed storage handles data migration and data replication
Leverages the computational resources of OSDs
Achieves reliable highly-available scalable object storage
R
e
l
i
a
b
l
e
 
i
m
p
l
i
e
s
 
n
o
 
d
a
t
a
 
l
o
s
s
e
s
H
i
g
h
l
y
 
a
v
a
i
l
a
b
l
e
 
i
m
p
l
i
e
s
 
b
e
i
n
g
 
a
c
c
e
s
s
i
b
l
e
a
l
m
o
s
t
 
a
l
l
 
t
h
e
 
t
i
m
e
THE CLIENT
Performing an I/O
Client synchronization
Namespace operations
Performing an I/O
When client opens a file
Sends a request to the MDS cluster
Receives an i-node number, information about file size and
striping strategy and a capability
Capability specifies authorized operations on file
N
o
t
 
y
e
t
 
e
n
c
r
y
p
t
e
d
Client uses CRUSH to locate object replicas
Client releases capability at close time
 
Client synchronization (I)
POSIX requires
One-copy serializability
Atomicity of writes
When MDS detects conflicting accesses by different clients
to the same file
Revokes all caching and buffering permissions for that file
R
e
q
u
i
r
e
s
 
s
y
n
c
h
r
o
n
o
u
s
 
I
/
O
 
t
o
 
t
h
e
 
f
i
l
e
Client synchronization (II)
Synchronization handled by OSDs
Locks can be used for writes spanning object boundaries
Synchronous I/O operations have huge latencies
Many scientific workloads do significant amount of read-write
sharing
P
O
S
I
X
 
e
x
t
e
n
s
i
o
n
 
l
e
t
s
 
a
p
p
l
i
c
a
t
i
o
n
s
 
s
y
n
c
h
r
o
n
i
z
e
 
t
h
e
i
r
c
o
n
c
u
r
r
e
n
t
 
a
c
c
e
s
s
e
s
 
t
o
 
a
 
f
i
l
e
Namespace operations
Managed by the MDSs
Read and update operations are all synchronously applied to
the metadata
Optimized for common case
r
e
a
d
d
i
r
 
r
e
t
u
r
n
s
 
c
o
n
t
e
n
t
s
 
o
f
 
w
h
o
l
e
 
d
i
r
e
c
t
o
r
y
 
(
a
s
 
N
F
S
 
r
e
a
d
d
i
r
p
l
u
s
d
o
e
s
)
Guarantees serializability of all operations
Can be relaxed by application
The MDS cluster
Storing metadata
Dynamic subtree partitioning
Mapping subdirectories to MDSs
Storing metadata
Most requests likely to be satisfied from MDS in-memory cache
Each MDS lodges its update operations in lazily-flushed journal
Facilitates recovery
Directories
Include i-nodes
Stored on a OSD cluster
Dynamic subtree partitioning
Ceph uses primary copy approach to cached metadata
management
Ceph adaptively distributes cached metadata across MDS nodes
 Each MDS measures popularity of data within a directory
Ceph migrates and/or replicates hot spots
Mapping subdirectories to MDSs
Replicating “hot” directories
Heavily read directories
Many  file accesses
Selectively replicated over different nodes
Heavily written directories
Many file creations
 Hashed across the cluster
Distributed object storage
Data distribution with CRUSH
Replication
Data safety
Recovery and cluster updates
EBOFS
Data distribution with CRUSH (I)
Wanted to avoid storing object addresses in MDS cluster
C
e
p
h
 
f
i
r
s
t
s
 
m
a
p
s
 
o
b
j
e
c
t
s
 
i
n
t
o
 
p
l
a
c
e
m
e
n
t
 
g
r
o
u
p
s
 
(
P
G
)
 
u
s
i
n
g
 
a
h
a
s
h
 
f
u
n
c
t
i
o
n
Placement groups are then assigned to OSDs using a
pseudo-random function (CRUSH)
Clients know that function
Data distribution with CRUSH (II)
To access an object, client needs to know
Its placement group
The OSD cluster map
The object placement rules used by CRUSH
Replication level
Placement constraints
How files are striped
Replication
Ceph’s Reliable Autonomic Data Object Store autonomously
manages object replication
First non-failed OSD in object’s replication list acts as a primary
copy
Applies each update locally
Increments object’s version number
Propagates the update
Data safety
Achieved by update process
1.
Primary forwards updates to other replicas
2.
Sends ACK to client once all replicas have received the
update
S
l
o
w
e
r
 
b
u
t
 
s
a
f
e
r
3.
Replicas send final commit once they have committed update
to disk
Committing writes
Recovery and cluster updates
RADOS (Reliable and Autonomous Distributed Object Store)
monitors OSDs to detect failures
Recovery handled by same mechanism as deployment of new
storage
Entirely driven by individual OSDs
Low-level storage management
Most DFS use an existing local file system to manage
low-level storage
Hard to understand when object updates are safely committed
on disk
Could use journaling or synchronous writes
Big performance penalty
Low-level storage management
Each Ceph OSD manages its local object storage with EBOFS
(Extent and B-Tree based Object File System)
  
B-Tree service locates objects on disk
Block allocation is conducted in term of extents to keep data
compact
Well-defined update semantics
Performance and scalability
Want to measure
Cost of updating replicated data
Throughput  and latency
Overall system performance
Scalability
Impact of MDS cluster size on latency
Impact of replication (I)
Impact of replication (II)
Transmission times dominate for large synchronized writes
File system performance
Scalability
Switch is saturated at 24 OSDs
Impact of MDS cluster size on latency
Conclusion
Ceph addresses three critical challenges of modern DFS
Scalability
Performance
Reliability
Achieved though reducing the workload of MDS
CRUSH
Autonomous repairs of OSD
Why this strange name?
C
e
p
h
 
s
t
a
n
d
s
 
f
o
r
 
c
e
p
h
a
l
o
p
o
d
Reminds of
M
u
l
t
i
t
a
s
k
i
n
g
 
o
f
 
o
c
t
o
p
u
s
B
a
n
a
n
a
 
s
l
u
g
,
 
t
h
e
 
o
f
f
i
c
i
a
l
 
U
C
S
C
 
m
a
s
c
o
t
Slide Note
Embed
Share

Ceph is a scalable, high-performance distributed file system designed for excellent performance, reliability, and scalability in very large systems. It employs innovative strategies like distributed dynamic metadata management, pseudo-random data distribution, and decoupling data and metadata tasks for efficient operations. The system architecture involves decoupling data and metadata management, autonomic distributed object storage, and dynamic partitioning of metadata tasks within the metadata cluster. Ceph aims to handle petabyte-scale storage with security, consistency, and coherence.

  • Ceph
  • Distributed File System
  • Scalability
  • High Performance
  • Object Storage

Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006

  2. Paper highlights Yet another distributed file system using object storage devices Designed for scalability Main contributions 1. Distributed dynamic metadata management through hashing 2. Pseudo-random data distribution function replaces object lists

  3. System objectives Excellent performance and reliability Unparallel scalability thanks to Distribution of metadata workload inside metadata cluster Use of object storage devices (OSDs) Designed for very large systems Petabyte scale (106gigabytes)

  4. Characteristics of very large systems Built incrementally Node failures are the norm Quality and character of workload changes over time

  5. System overview System architecture Key ideas Decoupling data and metadata Metadata management Autonomic distributed object storage

  6. System Architecture (I)

  7. System Architecture (II) Clients Export a near-POSIX file system interface Cluster of OSDs Store all data and metadata Communicate directly with clients Metadata server cluster Manages the namespace (files and directories) Security, consistency and coherence

  8. Key ideas Separate data and metadata management tasks - Metadata cluster does not have object lists Dynamic partitioning of metadata data tasks inside metadata cluster Avoids hot spots Let OSDs handle file migration and replication tasks

  9. Decoupling data and metadata Metadata cluster handles metadata operations Clients interact directly with OSD for all file I/O Low-level bloc allocation is delegated to OSDs Other OSD still require metadata cluster to hold object lists Ceph uses a special pseudo-random data distribution function (CRUSH)

  10. Old School File xyz? Metadata server cluster Client Where to find the container objects Metadata server keeps track of the locations of all containers

  11. Ceph with CRUSH File xyz? Metadata server cluster Client How to find the container objects Client uses CRUSH and data provided by MDS cluster to find the file

  12. Ceph with CRUSH Metadata server cluster File xyz? Here is how to find these container objects Client Client uses CRUSH and data provided by MDS cluster to find the file

  13. Metadata management Dynamic Subtree Partitioning Lets Ceph dynamically share metadata workload among tens or hundreds of metadata servers (MDSs) Sharing is dynamic and based on current access patterns Results in near-linear performance scaling in the number of MDSs

  14. Autonomic distributed object storage Distributed storage handles data migration and data replication Leverages the computational resources of OSDs Achieves reliable highly-available scalable object storage Reliable implies no data losses Highly available implies being accessible almost all the time

  15. THE CLIENT Performing an I/O Client synchronization Namespace operations

  16. Performing an I/O When client opens a file Sends a request to the MDS cluster Receives an i-node number, information about file size and striping strategy and a capability Capability specifies authorized operations on file Not yet encrypted Client uses CRUSH to locate object replicas Client releases capability at close time

  17. Client synchronization (I) POSIX requires One-copy serializability Atomicity of writes When MDS detects conflicting accesses by different clients to the same file Revokes all caching and buffering permissions for that file Requires synchronous I/O to the file

  18. Client synchronization (II) Synchronization handled by OSDs Locks can be used for writes spanning object boundaries Synchronous I/O operations have huge latencies Many scientific workloads do significant amount of read-write sharing POSIX extension lets applications synchronize their concurrent accesses to a file

  19. Namespace operations Managed by the MDSs Read and update operations are all synchronously applied to the metadata Optimized for common case readdir returns contents of whole directory (as NFS readdirplus does) Guarantees serializability of all operations Can be relaxed by application

  20. The MDS cluster Storing metadata Dynamic subtree partitioning Mapping subdirectories to MDSs

  21. Storing metadata Most requests likely to be satisfied from MDS in-memory cache Each MDS lodges its update operations in lazily-flushed journal Facilitates recovery Directories Include i-nodes Stored on a OSD cluster

  22. Dynamic subtree partitioning Ceph uses primary copy approach to cached metadata management Ceph adaptively distributes cached metadata across MDS nodes Each MDS measures popularity of data within a directory Ceph migrates and/or replicates hot spots

  23. Mapping subdirectories to MDSs

  24. Replicating hot directories Heavily read directories Many file accesses Selectively replicated over different nodes Heavily written directories Many file creations Hashed across the cluster

  25. Distributed object storage Data distribution with CRUSH Replication Data safety Recovery and cluster updates EBOFS

  26. Data distribution with CRUSH (I) Wanted to avoid storing object addresses in MDS cluster Ceph firsts maps objects into placement groups (PG) using a hash function Placement groups are then assigned to OSDs using a pseudo-random function (CRUSH) Clients know that function

  27. Data distribution with CRUSH (II) To access an object, client needs to know Its placement group The OSD cluster map The object placement rules used by CRUSH Replication level Placement constraints

  28. How files are striped

  29. Replication Ceph s Reliable Autonomic Data Object Store autonomously manages object replication First non-failed OSD in object s replication list acts as a primary copy Applies each update locally Increments object s version number Propagates the update

  30. Data safety Achieved by update process 1. Primary forwards updates to other replicas 2. Sends ACK to client once all replicas have received the update Slower but safer 3. Replicas send final commit once they have committed update to disk

  31. Committing writes

  32. Recovery and cluster updates RADOS (Reliable and Autonomous Distributed Object Store) monitors OSDs to detect failures Recovery handled by same mechanism as deployment of new storage Entirely driven by individual OSDs

  33. Low-level storage management Most DFS use an existing local file system to manage low-level storage Hard to understand when object updates are safely committed on disk Could use journaling or synchronous writes Big performance penalty

  34. Low-level storage management Each Ceph OSD manages its local object storage with EBOFS (Extent and B-Tree based Object File System) B-Tree service locates objects on disk Block allocation is conducted in term of extents to keep data compact Well-defined update semantics

  35. Performance and scalability Want to measure Cost of updating replicated data Throughput and latency Overall system performance Scalability Impact of MDS cluster size on latency

  36. Impact of replication (I)

  37. Impact of replication (II) Transmission times dominate for large synchronized writes

  38. File system performance

  39. Scalability Switch is saturated at 24 OSDs

  40. Impact of MDS cluster size on latency

  41. Conclusion Ceph addresses three critical challenges of modern DFS Scalability Performance Reliability Achieved though reducing the workload of MDS CRUSH Autonomous repairs of OSD

  42. Why this strange name? Ceph stands for cephalopod Reminds of Multitasking of octopus Banana slug, the official UCSC mascot

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#