Chain Replication for High Throughput Object Storage

 
Jeff Terrace
Jeff Terrace
Michael J. Freedman
Michael J. Freedman
 
Object Storage on CRAQ
Object Storage on CRAQ
 
High throughput chain replication for
High throughput chain replication for
read-mostly workloads
read-mostly workloads
Data Storage Revolution
 
Relational Databases
 
 
 
Object Storage (put/get)
Dynamo
PNUTS
CouchDB
MemcacheDB
Cassandra
Speed
Scalability
Availability
Throughput
No Complexity
Eventual Consistency
Write Request
Read Request
Read Request
Eventual Consistency
 
Writes ordered after commit
Reads can be out-of-order or stale
 
Easy to scale, high throughput
 
Difficult application programming model
Traditional Solution to Consistency
Write Request
 
Two-Phase
Commit:
1.
Prepare
2.
Vote: Yes
3.
Commit
4.
Ack
Strong Consistency
 
Reads and Writes strictly ordered
 
Easy programming
 
Expensive implementation
Doesn’t scale well
Our Goal
 
Easy programming
 
 
Easy to scale, high throughput
Chain Replication
 
HEAD
 
TAIL
Write Request
Read Request
 
W1
R1
W2
R2
R3
van Renesse &
Schneider
(OSDI 2004)
Chain Replication
 
Strong consistency
Simple replication
Increases write throughput
 
Low read throughput
 
Can we increase throughput?
Insight:
Most applications are read-heavy (100:1)
CRAQ
Two states per object – 
clean
 and 
dirty
 
V
1
 
V
1
 
V
1
 
V
1
 
V
1
CRAQ
 
Two states per object – 
clean
 and 
dirty
If latest version is 
clean
, return value
If 
dirty
, contact 
tail
 for latest version number
 
V
1
 
V
1
 
V
1
 
V
1
 
V
1
Write Request
 
,
V
2
 
,
V
2
 
,
V
2
Read Request
Read Request
 
,
V
2
 
V
2
 
V
2
 
V
2
 
V
2
 
V
2
Multicast Optimizations
Each chain forms group
Tail multicasts ACKs
 
V
1
 
V
1
 
V
1
 
V
1
 
,
V
2
 
,
V
2
 
,
V
2
 
,
V
2
V
2
 
V
2
 
V
2
 
V
2
 
V
2
Multicast Optimizations
Each chain forms group
Tail multicasts ACKs
Head multicasts write data
V
2
V
2
V
2
V
2
Write Request
 
,
V
3
 
,
V
3
 
,
V
3
 
,
V
3
 
V
2
 
,
V
3
 
V
3
CRAQ Benefits
 
From Chain Replication
Strong consistency
Simple replication
Increases write throughput
 
Additional Contributions
Read throughput scales :
Chain Replication with 
Apportioned 
Queries
Supports Eventual Consistency
High Diversity
 
Many data storage systems assume locality
Well connected, low latency
Real large applications are geo-replicated
To provide low latency
Fault tolerance
 
(source: 
Data Center Knowledge
)
TAIL
Multi-Datacenter CRAQ
HEAD
Replica
Replica
Replica
Replica
Replica
TAIL
Replica
Replica
DC1
 
DC2
 
DC3
Multi-Datacenter CRAQ
HEAD
Replica
Replica
Replica
Replica
Replica
TAIL
Replica
Replica
Client
DC1
DC2
DC3
Client
Motivation
 
1.
Popular vs. scarce objects
 
2.
Subset relevance
 
 
3.
Datacenter diversity
 
 
4.
Write locality
Solution
 
1.
Specify chain size
 
2.
List datacenters
dc
1
, dc
2
, … dc
N
 
3.
Separate sizes
dc
1
, chain_size
1
, …
 
4.
Specify master
Chain Configuration
Master Datacenter
HEAD
Replica
Replica
Replic
a
Replica
Replica
DC1
DC2
HEAD
Writer
TAIL
TAIL
Replica
Replica
DC3
 
Implementation
 
Approximately 3,000 lines of C++
Uses Tame extensions to SFS asynchronous
I/O and RPC libraries
Network operations use Sun RPC interfaces
Uses Yahoo’s ZooKeeper for coordination
Coordination Using ZooKeeper
Stores chain metadata
Monitors/notifies about node membership
CRAQ
CRAQ
CRAQ
CRAQ
CRAQ
CRAQ
CRAQ
CRAQ
CRAQ
 
DC1
 
DC3
 
DC2
ZooKeeper
ZooKeeper
ZooKeeper
Evaluation
 
Does CRAQ 
scale
 vs. CR?
How does 
write rate
 impact performance?
Can CRAQ recover from 
failures
?
How does 
WAN
 effect CRAQ?
 
Tests use Emulab network emulation testbed
Read Throughput as Writes Increase
Failure Recovery (Read Throughput)
Failure Recovery (Latency)
Time (s)
 
Time (s)
Geo-replicated Read Latency
If Single Object Put/Get Insufficient
 
Test-and-Set, Append, Increment
Trivial to implement
Head alone can evaluate
Multiple object transaction in same chain
Can still be performed easily
Head alone can evaluate
Multiple chains
An agreement protocol (2PC) can be used
Only heads of chains need to participate
Although degrades performance (use carefully!)
Summary
 
CRAQ Contributions?
Challenges trade-off of consistency vs. throughput
Provides strong consistency
Throughput scales linearly for read-mostly
Support for wide-area deployments of chains
Provides atomic operations and transactions
 
Thank
You
 
Questions?
Slide Note
Embed
Share

Chain replication is a technique used to achieve high throughput and scalability in object storage systems. It ensures strong consistency by maintaining replicas of data across a chain of nodes, enabling efficient read-mostly workloads. The approach simplifies programming complexity and enhances system scalability, making it an ideal solution for modern data storage requirements.

  • Chain Replication
  • Object Storage
  • High Throughput
  • Strong Consistency
  • Scalability

Uploaded on Aug 04, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Object Storage on CRAQ High throughput chain replication for read-mostly workloads Jeff Terrace Michael J. Freedman

  2. Data Storage Revolution Relational Databases Object Storage (put/get) Dynamo PNUTS CouchDB MemcacheDB Cassandra Speed Scalability Availability Throughput No Complexity

  3. Eventual Consistency Read Request Write Request Replica Replica A Manager Replica Replica B Read Request

  4. Eventual Consistency Writes ordered after commit Reads can be out-of-order or stale Easy to scale, high throughput Difficult application programming model

  5. Traditional Solution to Consistency Two-Phase Commit: 1. Prepare 2. Vote: Yes 3. Commit 4. Ack Replica Write Request Replica Manager Replica Replica

  6. Strong Consistency Reads and Writes strictly ordered Easy programming Expensive implementation Doesn t scale well

  7. Our Goal Easy programming Easy to scale, high throughput

  8. Chain Replication van Renesse & Schneider (OSDI 2004) W1 R1 W2 R2 R3 W1 R1 R2 W2 R3 Replica Write Request Read Request Replica Manager HEAD TAIL Replica Replica

  9. Chain Replication Strong consistency Simple replication Increases write throughput Low read throughput Can we increase throughput? Insight: Most applications are read-heavy (100:1)

  10. CRAQ Two states per object clean and dirty Read Request Read Request Read Request Read Request Read Request HEAD Replica Replica Replica TAIL V1 V1 V1 V1 V1

  11. CRAQ Two states per object clean and dirty If latest version is clean, return value If dirty, contact tail for latest version number Read Request Read Request Write Request 1 2 V1 V2 V1 HEAD Replica Replica Replica TAIL ,V2 V1 V2 ,V2 V1 V2 ,V2 V1 V2 ,V2 V1 V2 V1 V2

  12. Multicast Optimizations Each chain forms group Tail multicasts ACKs HEAD Replica Replica Replica TAIL V1 V2 ,V2 V1 V2 ,V2 V1 V2 ,V2 V1 V2 ,V2 V2

  13. Multicast Optimizations Each chain forms group Tail multicasts ACKs Head multicasts write data Write Request HEAD Replica Replica Replica TAIL V2 ,V3 V2 ,V3 V2 ,V3 V2 ,V3 V2,V3 V3

  14. CRAQ Benefits From Chain Replication Strong consistency Simple replication Increases write throughput Additional Contributions Read throughput scales : Chain Replication with Apportioned Queries Supports Eventual Consistency

  15. High Diversity Many data storage systems assume locality Well connected, low latency Real large applications are geo-replicated To provide low latency Fault tolerance (source: Data Center Knowledge)

  16. Multi-Datacenter CRAQ DC1 HEAD TAIL DC3 Replica Replica TAIL Replica Replica Replica Replica Replica DC2

  17. Multi-Datacenter CRAQ DC1 HEAD TAIL DC3 Replica Replica Client Replica Replica Client Replica Replica Replica DC2

  18. Chain Configuration Motivation Solution 1. Specify chain size 1. Popular vs. scarce objects 2. List datacenters dc1, dc2, dcN 2. Subset relevance 3. Separate sizes dc1, chain_size1, 3. Datacenter diversity 4. Specify master 4. Write locality

  19. Master Datacenter DC1 Writer HEAD TAIL Replica Replica TAIL Replica Replica DC3 Replic a Replica HEAD Replica DC2

  20. Implementation Approximately 3,000 lines of C++ Uses Tame extensions to SFS asynchronous I/O and RPC libraries Network operations use Sun RPC interfaces Uses Yahoo s ZooKeeper for coordination

  21. Coordination Using ZooKeeper Stores chain metadata Monitors/notifies about node membership DC2 DC1 CRAQ CRAQ CRAQ CRAQ ZooKeeper CRAQ ZooKeeper CRAQ ZooKeeper DC3 CRAQ CRAQ CRAQ

  22. Evaluation Does CRAQ scale vs. CR? How does write rate impact performance? Can CRAQ recover from failures? How does WAN effect CRAQ? Tests use Emulab network emulation testbed

  23. Read Throughput as Writes Increase 7x- 3x- 1x-

  24. Failure Recovery (Read Throughput)

  25. Failure Recovery (Latency) Time (s) Time (s)

  26. Geo-replicated Read Latency

  27. If Single Object Put/Get Insufficient Test-and-Set, Append, Increment Trivial to implement Head alone can evaluate Multiple object transaction in same chain Can still be performed easily Head alone can evaluate Multiple chains An agreement protocol (2PC) can be used Only heads of chains need to participate Although degrades performance (use carefully!)

  28. Summary CRAQ Contributions? Challenges trade-off of consistency vs. throughput Provides strong consistency Throughput scales linearly for read-mostly Support for wide-area deployments of chains Provides atomic operations and transactions Thank You Questions?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#