PELICAN: A Building Block for Exascale Cold Data Storage

undefined
 
PELICAN: A BUILDING BLOCK
FOR
EXASCALE COLD DATA
STORAGE
 
Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass,
Dave Harper, and Sergey Legtchenko, Aaron Ogus, Eric Peterson and Antony Rowstron
 
M
i
c
r
o
s
o
f
t
 
R
e
s
e
a
r
c
h
Data in the cloud
 
The Problem
 
A significant portion of the data in the cloud is rarely accessed and known as cold
data.
 
Pelican tries to address the need for cost-effective storage of cold data and serves
as a basic building block for exabyte scale storage for cold data.
Price versus Latency
 
Price versus Latency
 
Right Provisioning
 
Provision resources 
just
 for the cold data workload:
Disks:
o
Archival and SMR instead of commodity drives
 
Power
Cooling
Bandwidth
 
Servers
o
Enough for data management instead of 1 server/ 40 disks
 
Enough for the required workload
Not to keep all disks spinning
 
Advantages
 
Benefits of removing unnecessary resources:
 
High density of storage
Low hardware cost
Low operating cost (capped performance)
undefined
 
 
Pelican
 
The Pelican Rack
 
Mechanical, hardware and storage software stack co-
designed.
 
Right-provisioned for cold data workload:
 
52U rack with 1152 archival class 3.5” SATA disks.
Average size of 4.55 TB to provide a total of 5 PB of
storage
 
It uses 2 servers and no top of the rack switch
 
Only 8% of the disks are spinning concurrently.
 
Designed to store blobs which are infrequently accessed.
 
Resource Domain
 
Each domain is only provisioned to supply its resource to a subset of disks
Each disk uses resources from a set of resource domains
 
 
Domain-conflicting
 - Disks that are in the same resource domain.
Domain-disjoint
 - Disks that share no common resource domains.
 
 
Pelican domains
Cooling, Power, Vibration Bandwidth
 
Handling Right-Provisioning
 
Constraints over sets of active disks:
Hard: 
power, cooling, failure domains
Soft: bandwidth, vibration
 
Constraints in Pelican
2 active out of 16 per power domain
1 active out of 12 per cooling domain
1 active out of 2 per vibration domain
 
Schematic representation
 
12
 
6
 
16
 
Data Layout
 
Objective - maximize number of requests that can be concurrently serviced while
operating within constraints.
 
Each blob is stored over a set of disks.
It is split into a sequence of 128kB fragments. For each “k” fragments, additional
“r” fragments are generated.
The k+r fragments form a stripe
I
n Pelican they statically partition disks into groups and disks within a group can be
concurrently active. Thus they concentrate all conflicts over a few sets of disks.
undefined
Data Placement
 
 
Group Property:
o
Either fully conflicting
o
Or fully independent
 
I
n Pelican - 48 groups of 24 disks
o
4 classes of 12 fully-conflicting groups
 
Blob is stored in one group - Blob is stored over 18 disks (15+3)
 
Off-Rack metadata called “catalog” holds information about the blob placement
 
Advantages
 
Groups encapsulate the constraints and also reduce time to recover from a failed
disk.
 
Since blob is within a group, rebuild operation use disks spinning concurrently.
 
Simplifies the IO schedulers task.
 
Probability of conflict is O(n) which is an improvement from random placement
O(n
2
)
 
IO Scheduler
 
Traditional disks are optimized to reorder IOs to minimize seek latency.
 
Pelican - reorder requests in order to minimize the impact of spin up latency.
 
Four independent schedulers. Each scheduler services requests for its class and
reordering happens at a class level.
Each Scheduler uses two queues – one for rebuild operations and one for other
operations.
Reordering is done to amortize the group spin up latency over the set of
operations.
Rate limiting is done to manage the interference between rebuild and other
operations.
undefined
 
 
Implementation Considerations
 
 
Power Up in Standby is used to ensure disks do not spin without the Pelican storage
stack managing the constraints.
 
Group abstraction is exploited to parallelize the boot process.
Initialization is done on the group level.
If there are no user requests then all 4 groups are concurrently initialized
Takes around 3 minutes to initialize the entire rack.
 
Unexpected spin up disks is prevented by adding a special No Access flag to the
Windows Server driver stack.
 
The BIOS on the server is modified to ensure it can handle all 72 HBA’s.
 
 
undefined
 
Comparison against a system organized like
Pelican but with full provisioning for power
and cooling.
The FP uses the same physical internal
topology but disks are 
never spun down
.
 
Evaluation
 
Simulator Cross-Validation
 
• Burst workload, varying burst intensity
 
 
 
 
 
 
 
 
 
 
 
 
Simulator accurately predicts real system behavior for all metrics.
Performance – Rack Throughput
Service Time
Time to first byte
Power Consumption
Scheduler Performance
 
R
e
b
u
i
l
d
 
T
i
m
e
 
I
m
p
a
c
t
 
o
n
 
C
l
i
e
n
t
 
R
e
q
u
e
s
t
Cost of Fariness
 
I
m
p
a
c
t
 
o
n
 
r
e
j
e
c
t
 
r
a
t
e
 
I
m
p
a
c
t
 
o
n
 
t
h
r
o
u
g
h
p
u
t
 
a
n
d
 
l
a
t
e
n
c
y
 
Disk Lifetime
 
D
i
s
k
 
s
t
a
t
i
s
t
i
c
s
 
a
s
 
a
 
f
u
n
c
t
i
o
n
 
o
f
 
t
h
e
 
w
o
r
k
l
o
a
d
 
Pros                Cons
 
Reduced
Cost
Power Consumption
Erasure codes for fault
tolerance
Hardware abstraction
simplifies IO
schedulers work.
 
Tight constraints – less flexible
to changes
Sensitive to hardware changes
No Justification to some of the
configuration decisions made.
Not sure if it is an optimal
design
 
Discussion
 
They claim the values they chose to being a sweet spot in the hardware. How
is the performance affected by any changes to this??
 
More importantly how do we handle for systems of different configurations
and hardware.
 
Automate synthesizing of the Data layout – thoughts?
 
Slide Note
Embed
Share

PELICAN addresses the need for cost-effective storage of cold data in the cloud, offering exabyte-scale storage solutions. By provision resources specifically for cold data workloads, it optimizes storage efficiency and reduces unnecessary resources, resulting in high-density, low-cost storage with capped performance. The Pelican Rack, co-designed for mechanical, hardware, and storage software, is a key component of this innovative solution, providing efficient storage for infrequently accessed data. Utilizing resource domains such as cooling, power, vibration, and bandwidth, Pelican ensures optimal resource provisioning for different disk subsets.

  • Exascale Storage
  • Cold Data
  • Cost-Effective
  • Resource Provisioning
  • High-Density Storage

Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. PELICAN: A BUILDING BLOCK FOR EXASCALE COLD DATA STORAGE Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, and Sergey Legtchenko, Aaron Ogus, Eric Peterson and Antony Rowstron Microsoft Research

  2. Data in the cloud

  3. The Problem A significant portion of the data in the cloud is rarely accessed and known as cold data. Pelican tries to address the need for cost-effective storage of cold data and serves as a basic building block for exabyte scale storage for cold data.

  4. Price versus Latency

  5. Price versus Latency

  6. Right Provisioning Provision resources just for the cold data workload: Disks: o Archival and SMR instead of commodity drives Power Cooling Bandwidth Enough for the required workload Not to keep all disks spinning Servers o Enough for data management instead of 1 server/ 40 disks

  7. Advantages Benefits of removing unnecessary resources: High density of storage Low hardware cost Low operating cost (capped performance)

  8. Pelican

  9. The Pelican Rack Mechanical, hardware and storage software stack co- designed. Right-provisioned for cold data workload: 52U rack with 1152 archival class 3.5 SATA disks. Average size of 4.55 TB to provide a total of 5 PB of storage It uses 2 servers and no top of the rack switch Only 8% of the disks are spinning concurrently. Designed to store blobs which are infrequently accessed.

  10. Resource Domain Each domain is only provisioned to supply its resource to a subset of disks Each disk uses resources from a set of resource domains Domain-conflicting - Disks that are in the same resource domain. Domain-disjoint - Disks that share no common resource domains. Pelican domains Cooling, Power, Vibration Bandwidth

  11. Handling Right-Provisioning Constraints over sets of active disks: Hard: power, cooling, failure domains Soft: bandwidth, vibration Constraints in Pelican 2 active out of 16 per power domain 1 active out of 12 per cooling domain 1 active out of 2 per vibration domain

  12. Schematic representation 12 6 16

  13. Data Layout Objective - maximize number of requests that can be concurrently serviced while operating within constraints. Each blob is stored over a set of disks. It is split into a sequence of 128kB fragments. For each k fragments, additional r fragments are generated. The k+r fragments form a stripe In Pelican they statically partition disks into groups and disks within a group can be concurrently active. Thus they concentrate all conflicts over a few sets of disks.

  14. Data Placement

  15. Group Property: o Either fully conflicting o Or fully independent In Pelican - 48 groups of 24 disks o 4 classes of 12 fully-conflicting groups Blob is stored in one group - Blob is stored over 18 disks (15+3) Off-Rack metadata called catalog holds information about the blob placement

  16. Advantages Groups encapsulate the constraints and also reduce time to recover from a failed disk. Since blob is within a group, rebuild operation use disks spinning concurrently. Simplifies the IO schedulers task. Probability of conflict is O(n) which is an improvement from random placement O(n2)

  17. IO Scheduler Traditional disks are optimized to reorder IOs to minimize seek latency. Pelican - reorder requests in order to minimize the impact of spin up latency. Four independent schedulers. Each scheduler services requests for its class and reordering happens at a class level. Each Scheduler uses two queues one for rebuild operations and one for other operations. Reordering is done to amortize the group spin up latency over the set of operations. Rate limiting is done to manage the interference between rebuild and other operations.

  18. Implementation Considerations

  19. Power Up in Standby is used to ensure disks do not spin without the Pelican storage stack managing the constraints. Group abstraction is exploited to parallelize the boot process. Initialization is done on the group level. If there are no user requests then all 4 groups are concurrently initialized Takes around 3 minutes to initialize the entire rack. Unexpected spin up disks is prevented by adding a special No Access flag to the Windows Server driver stack. The BIOS on the server is modified to ensure it can handle all 72 HBA s.

  20. Evaluation Comparison against a system organized like Pelican but with full provisioning for power and cooling. The FP uses the same physical internal topology but disks are never spun down.

  21. Simulator Cross-Validation Burst workload, varying burst intensity Simulator accurately predicts real system behavior for all metrics.

  22. Performance Rack Throughput

  23. Service Time

  24. Time to first byte

  25. Power Consumption

  26. Scheduler Performance Rebuild Time Impact on Client Request

  27. Cost of Fariness Impact on throughput and latency Impact on reject rate

  28. Disk Lifetime Disk statistics as a function of the workload

  29. Pros Cons Reduced Cost Power Consumption Erasure codes for fault tolerance Hardware abstraction simplifies IO schedulers work. Tight constraints less flexible to changes Sensitive to hardware changes No Justification to some of the configuration decisions made. Not sure if it is an optimal design

  30. Discussion They claim the values they chose to being a sweet spot in the hardware. How is the performance affected by any changes to this?? More importantly how do we handle for systems of different configurations and hardware. Automate synthesizing of the Data layout thoughts?

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#