Cost/Performance in Modern Data Stores: How Data Caching Systems Succeed

Cost/Performance in
Modern Data Stores
How Data Caching Systems Succeed
David Lomet
Microsoft Research
Redmond, WA 98052
2/20/2025
SoCal 2019
1
Traditional Databases are Challenged
Traditional
Database
Systems
Streaming
Column
Store
Main
Memory
Row Store
Streaming
Column Store
Main Memory
Row Store
 
And the threat is
Perceived to be growing
2/20/2025
SoCal 2019
2
    MSFT
SQL Server
 
Amazon
  Aurora
2/20/2025
SoCal 2019
3
Traditional Data Caching Systems are Dead
 
The Elephants Still Dance
 
WHY?
2/20/2025
SoCal 2019
4
Here’s the story:
Traditional
Database Systems
WHY?
2/20/2025
SoCal 2019
5
Economic Factors
 
Hardware infrastructure needs to be
Cheap  => Commodity hardware
Software infrastructure needs to be
Efficient => high performance 
on this hardware
Elastic Provisioning
Free up expensive resources when otherwise under-utilized
Give resources to those with higher value (willing to pay more)
Under-provision for SLAs and “pocket” the difference
 
2/20/2025
SoCal 2019
6
Money Talks!
2/20/2025
SoCal 2019
7
Traditional Database Systems
Are
Caching Data Stores
2/20/2025
SoCal 2019
8
Traditional DBMSs have good Cost/Performance
 
We need to move away from being fixated only on performance
Costs vary enormously between systems
We describe why traditional systems
Have lower performance than recent research systems
Yet succeed by having better cost/performance
And suggest a change in focus for our field
2/20/2025
SoCal 2019
9
 
 
Moves data from secondary storage into main memory when it is to
be operated on
Removes data from main memory when it is no longer being
operated on
Writes it back to secondary storage when it has been changed
Drops it from main memory when unchanged
Data lives permanently on secondary storage
 
 
Caching Data Store …. Uses a Cache
2/20/2025
SoCal 2019
10
Result: Two Forms of Operation
 
MM
: main memory
VERY fast– N*10**6 ops/sec
SS
: secondary storage
About
 5.8X 
execution time of MM operations (our system)
Must do MM operation 
+
 I/O to bring data into main memory
 
Caching data stores look like a BAD idea
Caching data stores look like a BAD idea
2/20/2025
SoCal 2019
11
Mixed Operation Performance
  
weighted average of MM and SS operations: R = SS/MM ~ 5.8 +- 30%
2/20/2025
SoCal 2019
12
A Performance Disaster?
 
Each added SS operation pushes performance lower
Toward SLOW SS performance
Away from FAST MM performance
Why Bother???
Costs matter in addition to performance
2/20/2025
SoCal 2019
13
So let’s look at Cost/Performance
 
2/20/2025
SoCal 2019
14
Two Costs for Operating on Data
 
Storage Cost
Always paid
Most of cost for cold data
Or when data becomes cold
Execution Cost
Paid only when data is operated on
Most of cost for hot data
Or when data becomes hot
2/20/2025
SoCal 2019
15
New SSDs
support
n*100K IOPS
Approximate Bw-tree Relative Numbers
Its All About Relative Costs
2/20/2025
SoCal 2019
16
Cost Analysis
 
We compute cost/sec and plot it against operations/sec executed
Cost is a linear function of storage and execution costs:
C = A + B
X
, where we will plot C vs 
X
Cost/sec = (storage-rental/sec) + (cost/operation)*
(operations/sec)
     
C 
       =              
A
         +            
B
               *       X
A: 
Storage cost 
= (cost/byte)*(size of data) and is constant for an
operation on a piece of data
DRAM cost/byte ~ 10X Flash cost/byte
At ops/sec = 0, all cost is storage cost: 
[storage cost is y-intercept]
Execution cost 
= (cost/operation)*(operations/sec)
B: 
Cost/operation: 
SS op ~ 12X MM op
Cost/operation determines the slope of the cost “curve”
2/20/2025
SoCal 2019
17
Some Details
 
 
C  
      
=
              
A
         
+
            
B
               
*
       X
$SS  
= cost of secondary storage ops/sec =
$SSstor + $MMindex 
+
 
{$MMexec + ($ SSD IO)  + ($ Proc IO)}
*
Ops/sec
$MM 
=  cost of main memory ops/sec =
{$MMstor + $SSstor + $MMindx} 
+ 
{$Mmexec}
*
Ops/sec
We set 
$SS = $MM 
and solve for 
(Ops/sec)**(-1) = time between ops
$MMindx, $SSstor, $MMexec all cancel out
Time (between ops) 
= 
{($ SSD IO) + ($ Proc IO)}
/
{($M/byte) * Msize}
Gray’s Five Minute Rule: Now ~ One Minute
$SSD IO 
has dropped dramatically, so 
$ Proc IO 
is now important for cost
Our full analysis tells you also about costs away from break-even
 
 
 
2/20/2025
SoCal 2019
18
Gray “5-Min” Rule: 
derived from operation cost
Set ($MM ops/sec) = ($SS ops/sec) to determine how many ops/sec (how
much time between ops) for equal cost
$SSop = {$MMexec + ($ SSD IO)  + ($ Proc IO)}*Ops/sec + $Ssstor +
$MMindex
$MMop = {$Mmexec}*Ops/sec + $MMstor + $Ssstor + $MMindx
$MMstor = ($M/byte)*Msize*Time (between ops)
$Ssstor, $Mmexec, and $MMindx all cancel out-- yielding
Time (between ops)
 
= 
{
($ SSD IO) + ($ Proc IO)
}
/
{
($M/byte) * Msize
}
Note that Time depends upon Msize, unit of caching and frequently a page
Neglecting MMstor costs for B-tree index pages– about 1% of data page cost– cancel out
2/20/2025
SoCal 2019
19
Data Caching Op
RELATIVE COST
2/20/2025
SoCal 2019
20
PERFORMANCE
Ops/(unit time)
Caching System uses
lowest cost OP:
Either SS or MM
Gray 5 min Rule
Now ~ 1 min for
4K page
How About Other Kinds of Systems
 
2/20/2025
SoCal 2019
21
Main Memory Systems Perform Better
 
 
MassTree: high performance main memory data store
Deuteronomy: high performance caching data store
MassTree has higher performance than Deuteronomy’s Bw-tree
Relative costs
Execution cost: Deuteronomy is 2.6 X MassTree
because MassTree is 2.6X faster
Storage cost: MassTree is 2.3 X Deuteronomy
Because MassTree uses 2.3X the storage of Deuteronomy
indeed, data caching systems perform better when fully cached
– but with worse cost/performance
2/20/2025
SoCal 2019
22
 
2/20/2025
SoCal 2019
23
All or Nothing
No switching
between systems
Lowering Storage Cost
 
Facebook problem:
huge data volume
almost all cold
But requiring high performance and decent latency
Buying processors to be able to attach more SSDs
To accommodate volume of data
Solution: data caching system + data compression
Use SSD storage for decent latency
Scale out processors for good performance
Data compression to lower storage cost
Recall that storage cost is (cost/byte)*(size of data)
At the expense of increased execution cost
2/20/2025
SoCal 2019
24
COLD 80%
Cold Data is
in this range
Caching system can use
lowest cost operation from
choice of three
About 1 minute
4K page
SS and MM ops
HOT
20%
2/20/2025
SoCal 2019
25
Data Caching Op
RELATIVE COST
2/20/2025
SoCal 2019
26
PERFORMANCE
Ops/(unit time)
Caching System uses
lowest cost OP:
Either SS or MM
Gray 5 min Rule
Now ~ 1 min for
4K page
NVM Storage?
Used as Mem Extension
NVM Storage?
Used as SSD
Deeper
Memory
Hierarchy
Making a High Performance and
Low Cost Data Caching Systems
 
2/20/2025
SoCal 2019
27
How do we provide good cost/perf?
Data caching system is part of answer
But how do we optimize cost/performance in a data caching system
 
Increase in-memory
 
multi-core performance
Increased performance reduces cost – must be relevant to data caching
Improve single core performance, and multi-core scale-out
Reduce number of I/Os
Blind writes, log structuring, record cache
Reduce data movement cost: SSD to/from main memory
Reduce execution cost of I/O: e.g., user-level I/O
Reduce SSD I/O cost: e.g., SSD with more IOPS
Modify SSD: summer project with Jae Young Do and Ivan Picoli
Reduce cost of secondary storage
Compress Data for fewer bytes
Use flash, not NVRAM (NVRAM ~ 3X cost of Flash)
2/20/2025
SoCal 2019
28
New NVM
Technology?
CAVEATS
Two important ones
2/20/2025
SoCal 2019
29
Our Numbers are 
ApPR
ApPR
XIMATE
XIMATE
 
Performance based on
Set of point experiments
Executed on our machines
Using our Deuteronomy system
Costs from web prices
Variable at any one time
Changing over time
 
General thrust of the results is solid
But your mileage may vary!
2/20/2025
SoCal 2019
30
B
e
s
t
 
c
o
s
t
/
p
e
r
f
o
r
m
a
n
c
e
 
i
s
 
N
O
T
 
a
l
w
a
y
s
 
b
e
s
t
 
Low costs are a plus, but not the only thing
We want 
max[value – cost]
If value is high enough, low cost may be secondary
Value might depend on latency (or peak performance)
Some of the time, lowest latency wins the game and captures all the value
E.g. some stock trading applications
But this is not the common case
Most of time, interactive system latency is adequate
Anything less than a millisecond per op should be fine
E.g. online shopping
2/20/2025
SoCal 2019
31
Recap
 
Two costs
Storage: always present
Execution: only when operations execute
When data is:
Cold: storage cost dominates
Hot: execution cost dominates
Data Caching Systems “win” on cost/performance because they
Move hot data to (high cost) main memory cache for low execution cost
Move cold data to (low cost) secondary storage for low storage cost
We should be focusing on data caching improvements!
2/20/2025
SoCal 2019
32
Talk brought to you by Bw-Tree in Production
 
SQL Server Hekaton: 
Key-sequential index
Lock-free for high concurrency
consistent with Hekaton’s non-blocking main memory
architecture
In-memory Bw-tree
Azure DocumentDB: 
Indexing engine
Rich query processing over a schema-free JSON
model, with 
automatic indexing
Sustained document ingestion at high rates
Bw-tree + LLAMA
Bing ObjectStore: 
Sorted key-value store
Supports range queries
Optimized for flash SSDs
Bw-tree + LLAMA
 
ObjectStore
2/20/2025
SoCal 2019
33
Some References
David B. Lomet: 
Cost/performance in modern data stores: how data
caching systems succeed.
 
DaMoN 2018
: 9:1-9:10
https://dl.acm.org/citation.cfm?doid=3211922.3211927
Justin J. Levandoski, David B. Lomet, Sudipta Sengupta: 
The Bw-Tree: A B-
tree for new hardware platforms.
 ICDE 2013: 302-313
Justin J. Levandoski
, David B. Lomet, Sudipta Sengupta: 
LLAMA: A
Cache/Storage Subsystem for Modern Hardware.
 PVLDB 6(10): 877-888
(2013) 
http://www.vldb.org/pvldb/vol6/p877-levandoski.pdf
Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, Ryan Stutsman, Rui
Wang: 
High Performance Transactions in Deuteronomy.
 CIDR 2015
http://cidrdb.org/cidr2015/Papers/CIDR15_Paper15.pdf
2/20/2025
SoCal 2019
34
QUESTIONS?
2/20/2025
SoCal 2019
35
Slide Note
Embed
Share

Traditional databases are facing challenges with main memory row store and column store systems. Explore why traditional data caching systems are considered dead, while the elephants in the field still dance. Economic factors play a significant role in shaping hardware and software infrastructure needs. Money talks, as traditional database systems continue to cache data stores with good cost/performance, prompting a shift in focus for the industry.

  • Data Stores
  • Data Caching Systems
  • Traditional Databases
  • Performance
  • Cost

Uploaded on Feb 20, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Cost/Performance in Modern Data Stores How Data Caching Systems Succeed David Lomet Microsoft Research Redmond, WA 98052 2/20/2025 SoCal 2019 1

  2. Traditional Databases are Challenged Main Memory Row Store Main Memory Row Store And the threat is Perceived to be growing Traditional Database Systems Column Store Streaming Column Store Streaming 2/20/2025 SoCal 2019 2

  3. Traditional Data Caching Systems are Dead MSFT SQL Server MSFT SQL Server IBM DB2 Oracle Database Amazon Aurora 2/20/2025 SoCal 2019 3

  4. The Elephants Still Dance WHY? 2/20/2025 SoCal 2019 4

  5. Heres the story: Main Memory Row Store WHY? Niche market partnership absorbed Traditional Database Systems Column Store Streaming 2/20/2025 SoCal 2019 5

  6. Economic Factors Hardware infrastructure needs to be Cheap => Commodity hardware Software infrastructure needs to be Efficient => high performance on this hardware Elastic Provisioning Free up expensive resources when otherwise under-utilized Give resources to those with higher value (willing to pay more) Under-provision for SLAs and pocket the difference 2/20/2025 SoCal 2019 6

  7. Money Talks! $ 2/20/2025 SoCal 2019 7

  8. Traditional Database Systems Are Caching Data Stores 2/20/2025 SoCal 2019 8

  9. Traditional DBMSs have good Cost/Performance We need to move away from being fixated only on performance Costs vary enormously between systems We describe why traditional systems Have lower performance than recent research systems Yet succeed by having better cost/performance And suggest a change in focus for our field 2/20/2025 SoCal 2019 9

  10. Caching Data Store . Uses a Cache Moves data from secondary storage into main memory when it is to be operated on Removes data from main memory when it is no longer being operated on Writes it back to secondary storage when it has been changed Drops it from main memory when unchanged Data lives permanently on secondary storage 2/20/2025 SoCal 2019 10

  11. Result: Two Forms of Operation MM: main memory VERY fast N*10**6 ops/sec SS: secondary storage About 5.8X execution time of MM operations (our system) Must do MM operation + I/O to bring data into main memory Caching data stores look like a BAD idea 2/20/2025 SoCal 2019 11

  12. Mixed Operation Performance weighted average of MM and SS operations: R = SS/MM ~ 5.8 +- 30% 1 Expected Perf (max) Expected Perf (min) 0.9 Relative Performance 0.8 Actual Perf (1-core) Actual Perf (4-core) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 Percent of SS operations SoCal 2019 2/20/2025 12

  13. A Performance Disaster? Each added SS operation pushes performance lower Toward SLOW SS performance Away from FAST MM performance Why Bother??? Costs matter in addition to performance 2/20/2025 SoCal 2019 13

  14. So lets look at Cost/Performance 2/20/2025 SoCal 2019 14

  15. Two Costs for Operating on Data Storage Cost Always paid Most of cost for cold data Or when data becomes cold Execution Cost Paid only when data is operated on Most of cost for hot data Or when data becomes hot 2/20/2025 SoCal 2019 15

  16. Its All About Relative Costs 14 Relative Storage Costs Relative Execution Costs 12 12 New SSDs support n*100K IOPS 10 10 8 8 6 6 4 4 2 2 0 0 Main Memory Flash Storage Main Memory Op Execution Flash Storage Flash I/O Op Per Byte I/O Path Approximate Bw-tree Relative Numbers 2/20/2025 SoCal 2019 16

  17. Cost Analysis We compute cost/sec and plot it against operations/sec executed Cost is a linear function of storage and execution costs: C = A + BX, where we will plot C vs X Cost/sec = (storage-rental/sec) + (cost/operation)*(operations/sec) C = A + B * X A: Storage cost = (cost/byte)*(size of data) and is constant for an operation on a piece of data DRAM cost/byte ~ 10X Flash cost/byte At ops/sec = 0, all cost is storage cost: [storage cost is y-intercept] Execution cost = (cost/operation)*(operations/sec) B: Cost/operation: SS op ~ 12X MM op Cost/operation determines the slope of the cost curve 2/20/2025 SoCal 2019 17

  18. Some Details C = A + B * X $SS = cost of secondary storage ops/sec = $SSstor + $MMindex + {$MMexec + ($ SSD IO) + ($ Proc IO)}*Ops/sec $MM = cost of main memory ops/sec = {$MMstor + $SSstor + $MMindx} + {$Mmexec}*Ops/sec We set $SS = $MM and solve for (Ops/sec)**(-1) = time between ops $MMindx, $SSstor, $MMexec all cancel out Time (between ops) = {($ SSD IO) + ($ Proc IO)}/{($M/byte) * Msize} Gray s Five Minute Rule: Now ~ One Minute $SSD IO has dropped dramatically, so $ Proc IO is now important for cost Our full analysis tells you also about costs away from break-even 2/20/2025 SoCal 2019 18

  19. RELATIVE COST Cost/Performance 20 18 Gray 5 min Rule Now ~ 1 min for 4K page 16 14 12 10 8 Storage Cost Caching System uses lowest cost OP: Either SS or MM 6 4 2 0 Storage Cost High Performance PERFORMANCE Ops/(unit time) Data Caching Op MM Op Flash Op 2/20/2025 SoCal 2019 20

  20. How About Other Kinds of Systems 2/20/2025 SoCal 2019 21

  21. Main Memory Systems Perform Better indeed, data caching systems perform better when fully cached but with worse cost/performance MassTree: high performance main memory data store Deuteronomy: high performance caching data store MassTree has higher performance than Deuteronomy s Bw-tree Relative costs Execution cost: Deuteronomy is 2.6 X MassTree because MassTree is 2.6X faster Storage cost: MassTree is 2.3 X Deuteronomy Because MassTree uses 2.3X the storage of Deuteronomy 2/20/2025 SoCal 2019 22

  22. Cost/Performance: [Not to Scale] Data Caching Frequently Better Deuteronomy vs MassTree 40 About 3 seconds Per 4K page MassTree vs Bw-tree 3*10**6 ops/sec for 40GB 35 Deuteronomy has Better cost/performance 30 25 Relative Cost 20 All or Nothing No switching between systems 15 10 MM OP SS OP MassTree About 1 minute 4K page SS and MM ops 5 0 no operations Breakeven High High+ High++ High+++ Performance: Ops/Unit-time 2/20/2025 SoCal 2019 23

  23. Lowering Storage Cost Facebook problem: huge data volume almost all cold But requiring high performance and decent latency Buying processors to be able to attach more SSDs To accommodate volume of data Solution: data caching system + data compression Use SSD storage for decent latency Scale out processors for good performance Data compression to lower storage cost Recall that storage cost is (cost/byte)*(size of data) At the expense of increased execution cost 2/20/2025 SoCal 2019 24

  24. About 1 minute 4K page SS and MM ops Cost/Performance: Not to Scale With Compressed Data(hypothetical) 50 45 HOT 20% 40 35 Relative Cost Cold Data is in this range 30 25 20 Caching system can use lowest cost operation from choice of three COLD 80% 10 15 MM OP CSS SS OP CA OP 5 CSS used here for cold data 0 0 1 2 3 6 5 7 8 9 Performance: Ops/Unit-time 2/20/2025 SoCal 2019 25

  25. NVM Storage? Used as SSD RELATIVE COST 20 NVM Storage? Used as Mem Extension NVM Cost/Performance Gray 5 min Rule Now ~ 1 min for 4K page 18 16 14 12 Deeper Memory Hierarchy 10 8 6 Caching System uses lowest cost OP: Either SS or MM Storage Cost 4 2 0 Storage Cost High Performance PERFORMANCE Ops/(unit time) Data Caching Op MM Op Flash Op SoCal 2019 2/20/2025 26

  26. Making a High Performance and Low Cost Data Caching Systems 2/20/2025 SoCal 2019 27

  27. How do we provide good cost/perf? Data caching system is part of answer But how do we optimize cost/performance in a data caching system Increase in-memorymulti-core performance Increased performance reduces cost must be relevant to data caching Improve single core performance, and multi-core scale-out Reduce number of I/Os Blind writes, log structuring, record cache Reduce data movement cost: SSD to/from main memory Reduce execution cost of I/O: e.g., user-level I/O Reduce SSD I/O cost: e.g., SSD with more IOPS Modify SSD: summer project with Jae Young Do and Ivan Picoli Reduce cost of secondary storage Compress Data for fewer bytes Use flash, not NVRAM (NVRAM ~ 3X cost of Flash) New NVM Technology? 2/20/2025 SoCal 2019 28

  28. CAVEATS Two important ones 2/20/2025 SoCal 2019 29

  29. Our Numbers are ApPRXIMATE Performance based on Set of point experiments Executed on our machines Using our Deuteronomy system Costs from web prices Variable at any one time Changing over time General thrust of the results is solid But your mileage may vary! 2/20/2025 SoCal 2019 30

  30. Best cost/performance is NOT NOT always best Low costs are a plus, but not the only thing We want max[value cost] If value is high enough, low cost may be secondary Value might depend on latency (or peak performance) Some of the time, lowest latency wins the game and captures all the value E.g. some stock trading applications But this is not the common case Most of time, interactive system latency is adequate Anything less than a millisecond per op should be fine E.g. online shopping 2/20/2025 SoCal 2019 31

  31. Recap Two costs Storage: always present Execution: only when operations execute When data is: Cold: storage cost dominates Hot: execution cost dominates Data Caching Systems win on cost/performance because they Move hot data to (high cost) main memory cache for low execution cost Move cold data to (low cost) secondary storage for low storage cost We should be focusing on data caching improvements! 2/20/2025 SoCal 2019 32

  32. Talk brought to you by Bw-Tree in Production SQL Server Hekaton: Key-sequential index Lock-free for high concurrency consistent with Hekaton s non-blocking main memory architecture In-memory Bw-tree Azure DocumentDB: Indexing engine Rich query processing over a schema-free JSON model, with automatic indexing Sustained document ingestion at high rates Bw-tree + LLAMA Bing ObjectStore: Sorted key-value store Supports range queries Optimized for flash SSDs Bw-tree + LLAMA ObjectStore 2/20/2025 SoCal 2019 33

  33. Some References David B. Lomet: Cost/performance in modern data stores: how data caching systems succeed. DaMoN 2018: 9:1-9:10 https://dl.acm.org/citation.cfm?doid=3211922.3211927 Justin J. Levandoski, David B. Lomet, Sudipta Sengupta: The Bw-Tree: A B- tree for new hardware platforms. ICDE 2013: 302-313 Justin J. Levandoski, David B. Lomet, Sudipta Sengupta: LLAMA: A Cache/Storage Subsystem for Modern Hardware. PVLDB 6(10): 877-888 (2013) http://www.vldb.org/pvldb/vol6/p877-levandoski.pdf Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, Ryan Stutsman, Rui Wang: High Performance Transactions in Deuteronomy. CIDR 2015 http://cidrdb.org/cidr2015/Papers/CIDR15_Paper15.pdf 2/20/2025 SoCal 2019 34

  34. QUESTIONS? 2/20/2025 SoCal 2019 35

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#