Cost/Performance in Modern Data Stores: How Data Caching Systems Succeed

Cost/Performance in

Modern Data Stores

How Data Caching Systems Succeed

David Lomet

Microsoft Research

Redmond, WA 98052

2/20/2025

SoCal 2019

Traditional Databases are Challenged

Traditional

Database

Systems

Streaming

Column

Store

Main

Memory

Row Store

Streaming

Column Store

Main Memory

Row Store

And the threat is

Perceived to be growing

2/20/2025

SoCal 2019

    MSFT

SQL Server

Amazon

  Aurora

2/20/2025

SoCal 2019

Traditional Data Caching Systems are Dead

The Elephants Still Dance

WHY?

2/20/2025

SoCal 2019

Here’s the story:

Traditional

Database Systems

WHY?

2/20/2025

SoCal 2019

Economic Factors

•

Hardware infrastructure needs to be

•

Cheap  => Commodity hardware

•

Software infrastructure needs to be

•

Efficient => high performance

on this hardware

•

Elastic Provisioning

•

Free up expensive resources when otherwise under-utilized

•

Give resources to those with higher value (willing to pay more)

•

Under-provision for SLAs and “pocket” the difference

2/20/2025

SoCal 2019

Money Talks!

2/20/2025

SoCal 2019

Traditional Database Systems

Are

Caching Data Stores

2/20/2025

SoCal 2019

Traditional DBMSs have good Cost/Performance

•

We need to move away from being fixated only on performance

•

Costs vary enormously between systems

•

We describe why traditional systems

•

Have lower performance than recent research systems

•

Yet succeed by having better cost/performance

•

And suggest a change in focus for our field

2/20/2025

SoCal 2019

•

Moves data from secondary storage into main memory when it is to

be operated on

•

Removes data from main memory when it is no longer being

operated on

•

Writes it back to secondary storage when it has been changed

•

Drops it from main memory when unchanged

•

Data lives permanently on secondary storage

Caching Data Store …. Uses a Cache

2/20/2025

SoCal 2019

Result: Two Forms of Operation

•

MM

: main memory

•

VERY fast– N*10**6 ops/sec

•

SS

: secondary storage

•

About

 5.8X

execution time of MM operations (our system)

•

Must do MM operation

 I/O to bring data into main memory

•

Caching data stores look like a BAD idea

Caching data stores look like a BAD idea

2/20/2025

SoCal 2019

Mixed Operation Performance

weighted average of MM and SS operations: R = SS/MM ~ 5.8 +- 30%

2/20/2025

SoCal 2019

A Performance Disaster?

•

Each added SS operation pushes performance lower

•

Toward SLOW SS performance

•

Away from FAST MM performance

•

Why Bother???

•

Costs matter in addition to performance

2/20/2025

SoCal 2019

So let’s look at Cost/Performance

2/20/2025

SoCal 2019

Two Costs for Operating on Data

•

Storage Cost

•

Always paid

•

Most of cost for cold data

•

Or when data becomes cold

•

Execution Cost

•

Paid only when data is operated on

•

Most of cost for hot data

•

Or when data becomes hot

2/20/2025

SoCal 2019

New SSDs

support

n*100K IOPS

Approximate Bw-tree Relative Numbers

Its All About Relative Costs

2/20/2025

SoCal 2019

Cost Analysis

•

We compute cost/sec and plot it against operations/sec executed

•

Cost is a linear function of storage and execution costs:

•

C = A + B

, where we will plot C vs

•

Cost/sec = (storage-rental/sec) + (cost/operation)*

(operations/sec)

•

               *       X

•

A:

Storage cost

= (cost/byte)*(size of data) and is constant for an

operation on a piece of data

•

DRAM cost/byte ~ 10X Flash cost/byte

•

At ops/sec = 0, all cost is storage cost:

[storage cost is y-intercept]

•

Execution cost

= (cost/operation)*(operations/sec)

•

B:

Cost/operation:

SS op ~ 12X MM op

•

Cost/operation determines the slope of the cost “curve”

2/20/2025

SoCal 2019

Some Details

•

•

$SS

= cost of secondary storage ops/sec =

•

$SSstor + $MMindex

{$MMexec + ($ SSD IO)  + ($ Proc IO)}

Ops/sec

•

$MM

=  cost of main memory ops/sec =

•

{$MMstor + $SSstor + $MMindx}

{$Mmexec}

Ops/sec

•

We set

$SS = $MM

and solve for

(Ops/sec)**(-1) = time between ops

•

$MMindx, $SSstor, $MMexec all cancel out

•

Time (between ops)

{($ SSD IO) + ($ Proc IO)}

{($M/byte) * Msize}

•

Gray’s Five Minute Rule: Now ~ One Minute

•

$SSD IO

has dropped dramatically, so

$ Proc IO

is now important for cost

•

Our full analysis tells you also about costs away from break-even

2/20/2025

SoCal 2019

Gray “5-Min” Rule:

derived from operation cost

•

Set ($MM ops/sec) = ($SS ops/sec) to determine how many ops/sec (how

much time between ops) for equal cost

•

$SSop = {$MMexec + ($ SSD IO)  + ($ Proc IO)}*Ops/sec + $Ssstor +

$MMindex

•

$MMop = {$Mmexec}*Ops/sec + $MMstor + $Ssstor + $MMindx

•

$MMstor = ($M/byte)*Msize*Time (between ops)

•

$Ssstor, $Mmexec, and $MMindx all cancel out-- yielding

•

Time (between ops)

($ SSD IO) + ($ Proc IO)

($M/byte) * Msize

•

Note that Time depends upon Msize, unit of caching and frequently a page

•

Neglecting MMstor costs for B-tree index pages– about 1% of data page cost– cancel out

2/20/2025

SoCal 2019

Data Caching Op

RELATIVE COST

2/20/2025

SoCal 2019

PERFORMANCE

Ops/(unit time)

Caching System uses

lowest cost OP:

Either SS or MM

Gray 5 min Rule

Now ~ 1 min for

4K page

How About Other Kinds of Systems

2/20/2025

SoCal 2019

Main Memory Systems Perform Better

•

MassTree: high performance main memory data store

•

Deuteronomy: high performance caching data store

•

MassTree has higher performance than Deuteronomy’s Bw-tree

•

Relative costs

•

Execution cost: Deuteronomy is 2.6 X MassTree

•

because MassTree is 2.6X faster

•

Storage cost: MassTree is 2.3 X Deuteronomy

•

Because MassTree uses 2.3X the storage of Deuteronomy

indeed, data caching systems perform better when fully cached

– but with worse cost/performance

2/20/2025

SoCal 2019

2/20/2025

SoCal 2019

All or Nothing

No switching

between systems

Lowering Storage Cost

•

Facebook problem:

•

huge data volume

•

almost all cold

•

But requiring high performance and decent latency

•

Buying processors to be able to attach more SSDs

•

To accommodate volume of data

•

Solution: data caching system + data compression

•

Use SSD storage for decent latency

•

Scale out processors for good performance

•

Data compression to lower storage cost

•

Recall that storage cost is (cost/byte)*(size of data)

•

At the expense of increased execution cost

2/20/2025

SoCal 2019

COLD 80%

Cold Data is

in this range

Caching system can use

lowest cost operation from

choice of three

About 1 minute

4K page

SS and MM ops

HOT

20%

2/20/2025

SoCal 2019

Data Caching Op

RELATIVE COST

2/20/2025

SoCal 2019

PERFORMANCE

Ops/(unit time)

Caching System uses

lowest cost OP:

Either SS or MM

Gray 5 min Rule

Now ~ 1 min for

4K page

NVM Storage?

Used as Mem Extension

NVM Storage?

Used as SSD

Deeper

Memory

Hierarchy

Making a High Performance and

Low Cost Data Caching Systems

2/20/2025

SoCal 2019

How do we provide good cost/perf?

Data caching system is part of answer

But how do we optimize cost/performance in a data caching system



Increase in-memory

multi-core performance

•

Increased performance reduces cost – must be relevant to data caching

•

Improve single core performance, and multi-core scale-out



Reduce number of I/Os

•

Blind writes, log structuring, record cache



Reduce data movement cost: SSD to/from main memory

•

Reduce execution cost of I/O: e.g., user-level I/O

•

Reduce SSD I/O cost: e.g., SSD with more IOPS

•

Modify SSD: summer project with Jae Young Do and Ivan Picoli



Reduce cost of secondary storage

•

Compress Data for fewer bytes

•

Use flash, not NVRAM (NVRAM ~ 3X cost of Flash)

2/20/2025

SoCal 2019

New NVM

Technology?

CAVEATS

Two important ones

2/20/2025

SoCal 2019

Our Numbers are

ApPR

ApPR





XIMATE

XIMATE

•

Performance based on

•

Set of point experiments

•

Executed on our machines

•

Using our Deuteronomy system

•

Costs from web prices

•

Variable at any one time

•

Changing over time

•

General thrust of the results is solid

•

But your mileage may vary!

2/20/2025

SoCal 2019

•

Low costs are a plus, but not the only thing

•

We want

max[value – cost]

•

If value is high enough, low cost may be secondary

•

Value might depend on latency (or peak performance)

•

Some of the time, lowest latency wins the game and captures all the value

•

E.g. some stock trading applications

•

But this is not the common case

•

Most of time, interactive system latency is adequate

•

Anything less than a millisecond per op should be fine

•

E.g. online shopping

2/20/2025

SoCal 2019

Recap

•

Two costs

•

Storage: always present

•

Execution: only when operations execute

•

When data is:

•

Cold: storage cost dominates

•

Hot: execution cost dominates

•

Data Caching Systems “win” on cost/performance because they

•

Move hot data to (high cost) main memory cache for low execution cost

•

Move cold data to (low cost) secondary storage for low storage cost

•

We should be focusing on data caching improvements!

2/20/2025

SoCal 2019

Talk brought to you by Bw-Tree in Production

•

SQL Server Hekaton:

Key-sequential index

•

Lock-free for high concurrency

•

consistent with Hekaton’s non-blocking main memory

architecture

•

In-memory Bw-tree

•

Azure DocumentDB:

Indexing engine

•

Rich query processing over a schema-free JSON

model, with

automatic indexing

•

Sustained document ingestion at high rates

•

Bw-tree + LLAMA

•

Bing ObjectStore:

Sorted key-value store

•

Supports range queries

•

Optimized for flash SSDs

•

Bw-tree + LLAMA

ObjectStore

2/20/2025

SoCal 2019

Some References

•

David B. Lomet:

Cost/performance in modern data stores: how data

caching systems succeed.

DaMoN 2018

: 9:1-9:10

https://dl.acm.org/citation.cfm?doid=3211922.3211927

•

Justin J. Levandoski, David B. Lomet, Sudipta Sengupta:

The Bw-Tree: A B-

tree for new hardware platforms.

 ICDE 2013: 302-313

•

Justin J. Levandoski

, David B. Lomet, Sudipta Sengupta:

LLAMA: A

Cache/Storage Subsystem for Modern Hardware.

 PVLDB 6(10): 877-888

(2013)

http://www.vldb.org/pvldb/vol6/p877-levandoski.pdf

•

Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, Ryan Stutsman, Rui

Wang:

High Performance Transactions in Deuteronomy.

 CIDR 2015

http://cidrdb.org/cidr2015/Papers/CIDR15_Paper15.pdf

2/20/2025

SoCal 2019

QUESTIONS?

2/20/2025

SoCal 2019

Slide Note

Embed Share

Download

Traditional databases are facing challenges with main memory row store and column store systems. Explore why traditional data caching systems are considered dead, while the elephants in the field still dance. Economic factors play a significant role in shaping hardware and software infrastructure needs. Money talks, as traditional database systems continue to cache data stores with good cost/performance, prompting a shift in focus for the industry.

tal_bel Follow

Uploaded on Feb 20, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Cost/Performance in Modern Data Stores How Data Caching Systems Succeed David Lomet Microsoft Research Redmond, WA 98052 2/20/2025 SoCal 2019 1

Traditional Databases are Challenged Main Memory Row Store Main Memory Row Store And the threat is Perceived to be growing Traditional Database Systems Column Store Streaming Column Store Streaming 2/20/2025 SoCal 2019 2

Traditional Data Caching Systems are Dead MSFT SQL Server MSFT SQL Server IBM DB2 Oracle Database Amazon Aurora 2/20/2025 SoCal 2019 3

The Elephants Still Dance WHY? 2/20/2025 SoCal 2019 4

Heres the story: Main Memory Row Store WHY? Niche market partnership absorbed Traditional Database Systems Column Store Streaming 2/20/2025 SoCal 2019 5

Economic Factors Hardware infrastructure needs to be Cheap => Commodity hardware Software infrastructure needs to be Efficient => high performance on this hardware Elastic Provisioning Free up expensive resources when otherwise under-utilized Give resources to those with higher value (willing to pay more) Under-provision for SLAs and pocket the difference 2/20/2025 SoCal 2019 6

Money Talks! $ 2/20/2025 SoCal 2019 7

Traditional Database Systems Are Caching Data Stores 2/20/2025 SoCal 2019 8

Traditional DBMSs have good Cost/Performance We need to move away from being fixated only on performance Costs vary enormously between systems We describe why traditional systems Have lower performance than recent research systems Yet succeed by having better cost/performance And suggest a change in focus for our field 2/20/2025 SoCal 2019 9

Caching Data Store . Uses a Cache Moves data from secondary storage into main memory when it is to be operated on Removes data from main memory when it is no longer being operated on Writes it back to secondary storage when it has been changed Drops it from main memory when unchanged Data lives permanently on secondary storage 2/20/2025 SoCal 2019 10

Result: Two Forms of Operation MM: main memory VERY fast N*10**6 ops/sec SS: secondary storage About 5.8X execution time of MM operations (our system) Must do MM operation + I/O to bring data into main memory Caching data stores look like a BAD idea 2/20/2025 SoCal 2019 11

Mixed Operation Performance weighted average of MM and SS operations: R = SS/MM ~ 5.8 +- 30% 1 Expected Perf (max) Expected Perf (min) 0.9 Relative Performance 0.8 Actual Perf (1-core) Actual Perf (4-core) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 Percent of SS operations SoCal 2019 2/20/2025 12

A Performance Disaster? Each added SS operation pushes performance lower Toward SLOW SS performance Away from FAST MM performance Why Bother??? Costs matter in addition to performance 2/20/2025 SoCal 2019 13

So lets look at Cost/Performance 2/20/2025 SoCal 2019 14

Two Costs for Operating on Data Storage Cost Always paid Most of cost for cold data Or when data becomes cold Execution Cost Paid only when data is operated on Most of cost for hot data Or when data becomes hot 2/20/2025 SoCal 2019 15

Its All About Relative Costs 14 Relative Storage Costs Relative Execution Costs 12 12 New SSDs support n*100K IOPS 10 10 8 8 6 6 4 4 2 2 0 0 Main Memory Flash Storage Main Memory Op Execution Flash Storage Flash I/O Op Per Byte I/O Path Approximate Bw-tree Relative Numbers 2/20/2025 SoCal 2019 16

Cost Analysis We compute cost/sec and plot it against operations/sec executed Cost is a linear function of storage and execution costs: C = A + BX, where we will plot C vs X Cost/sec = (storage-rental/sec) + (cost/operation)*(operations/sec) C = A + B * X A: Storage cost = (cost/byte)*(size of data) and is constant for an operation on a piece of data DRAM cost/byte ~ 10X Flash cost/byte At ops/sec = 0, all cost is storage cost: [storage cost is y-intercept] Execution cost = (cost/operation)*(operations/sec) B: Cost/operation: SS op ~ 12X MM op Cost/operation determines the slope of the cost curve 2/20/2025 SoCal 2019 17

Some Details C = A + B * X $SS = cost of secondary storage ops/sec = $SSstor + $MMindex + {$MMexec + ($ SSD IO) + ($ Proc IO)}*Ops/sec $MM = cost of main memory ops/sec = {$MMstor + $SSstor + $MMindx} + {$Mmexec}*Ops/sec We set $SS = $MM and solve for (Ops/sec)**(-1) = time between ops $MMindx, $SSstor, $MMexec all cancel out Time (between ops) = {($ SSD IO) + ($ Proc IO)}/{($M/byte) * Msize} Gray s Five Minute Rule: Now ~ One Minute $SSD IO has dropped dramatically, so $ Proc IO is now important for cost Our full analysis tells you also about costs away from break-even 2/20/2025 SoCal 2019 18

RELATIVE COST Cost/Performance 20 18 Gray 5 min Rule Now ~ 1 min for 4K page 16 14 12 10 8 Storage Cost Caching System uses lowest cost OP: Either SS or MM 6 4 2 0 Storage Cost High Performance PERFORMANCE Ops/(unit time) Data Caching Op MM Op Flash Op 2/20/2025 SoCal 2019 20

How About Other Kinds of Systems 2/20/2025 SoCal 2019 21

Main Memory Systems Perform Better indeed, data caching systems perform better when fully cached but with worse cost/performance MassTree: high performance main memory data store Deuteronomy: high performance caching data store MassTree has higher performance than Deuteronomy s Bw-tree Relative costs Execution cost: Deuteronomy is 2.6 X MassTree because MassTree is 2.6X faster Storage cost: MassTree is 2.3 X Deuteronomy Because MassTree uses 2.3X the storage of Deuteronomy 2/20/2025 SoCal 2019 22

Cost/Performance: [Not to Scale] Data Caching Frequently Better Deuteronomy vs MassTree 40 About 3 seconds Per 4K page MassTree vs Bw-tree 3*10**6 ops/sec for 40GB 35 Deuteronomy has Better cost/performance 30 25 Relative Cost 20 All or Nothing No switching between systems 15 10 MM OP SS OP MassTree About 1 minute 4K page SS and MM ops 5 0 no operations Breakeven High High+ High++ High+++ Performance: Ops/Unit-time 2/20/2025 SoCal 2019 23

Lowering Storage Cost Facebook problem: huge data volume almost all cold But requiring high performance and decent latency Buying processors to be able to attach more SSDs To accommodate volume of data Solution: data caching system + data compression Use SSD storage for decent latency Scale out processors for good performance Data compression to lower storage cost Recall that storage cost is (cost/byte)*(size of data) At the expense of increased execution cost 2/20/2025 SoCal 2019 24

About 1 minute 4K page SS and MM ops Cost/Performance: Not to Scale With Compressed Data(hypothetical) 50 45 HOT 20% 40 35 Relative Cost Cold Data is in this range 30 25 20 Caching system can use lowest cost operation from choice of three COLD 80% 10 15 MM OP CSS SS OP CA OP 5 CSS used here for cold data 0 0 1 2 3 6 5 7 8 9 Performance: Ops/Unit-time 2/20/2025 SoCal 2019 25

NVM Storage? Used as SSD RELATIVE COST 20 NVM Storage? Used as Mem Extension NVM Cost/Performance Gray 5 min Rule Now ~ 1 min for 4K page 18 16 14 12 Deeper Memory Hierarchy 10 8 6 Caching System uses lowest cost OP: Either SS or MM Storage Cost 4 2 0 Storage Cost High Performance PERFORMANCE Ops/(unit time) Data Caching Op MM Op Flash Op SoCal 2019 2/20/2025 26

Making a High Performance and Low Cost Data Caching Systems 2/20/2025 SoCal 2019 27

How do we provide good cost/perf? Data caching system is part of answer But how do we optimize cost/performance in a data caching system Increase in-memorymulti-core performance Increased performance reduces cost must be relevant to data caching Improve single core performance, and multi-core scale-out Reduce number of I/Os Blind writes, log structuring, record cache Reduce data movement cost: SSD to/from main memory Reduce execution cost of I/O: e.g., user-level I/O Reduce SSD I/O cost: e.g., SSD with more IOPS Modify SSD: summer project with Jae Young Do and Ivan Picoli Reduce cost of secondary storage Compress Data for fewer bytes Use flash, not NVRAM (NVRAM ~ 3X cost of Flash) New NVM Technology? 2/20/2025 SoCal 2019 28

CAVEATS Two important ones 2/20/2025 SoCal 2019 29

Our Numbers are ApPRXIMATE Performance based on Set of point experiments Executed on our machines Using our Deuteronomy system Costs from web prices Variable at any one time Changing over time General thrust of the results is solid But your mileage may vary! 2/20/2025 SoCal 2019 30

Best cost/performance is NOT NOT always best Low costs are a plus, but not the only thing We want max[value cost] If value is high enough, low cost may be secondary Value might depend on latency (or peak performance) Some of the time, lowest latency wins the game and captures all the value E.g. some stock trading applications But this is not the common case Most of time, interactive system latency is adequate Anything less than a millisecond per op should be fine E.g. online shopping 2/20/2025 SoCal 2019 31

Recap Two costs Storage: always present Execution: only when operations execute When data is: Cold: storage cost dominates Hot: execution cost dominates Data Caching Systems win on cost/performance because they Move hot data to (high cost) main memory cache for low execution cost Move cold data to (low cost) secondary storage for low storage cost We should be focusing on data caching improvements! 2/20/2025 SoCal 2019 32

Talk brought to you by Bw-Tree in Production SQL Server Hekaton: Key-sequential index Lock-free for high concurrency consistent with Hekaton s non-blocking main memory architecture In-memory Bw-tree Azure DocumentDB: Indexing engine Rich query processing over a schema-free JSON model, with automatic indexing Sustained document ingestion at high rates Bw-tree + LLAMA Bing ObjectStore: Sorted key-value store Supports range queries Optimized for flash SSDs Bw-tree + LLAMA ObjectStore 2/20/2025 SoCal 2019 33

Some References David B. Lomet: Cost/performance in modern data stores: how data caching systems succeed. DaMoN 2018: 9:1-9:10 https://dl.acm.org/citation.cfm?doid=3211922.3211927 Justin J. Levandoski, David B. Lomet, Sudipta Sengupta: The Bw-Tree: A B- tree for new hardware platforms. ICDE 2013: 302-313 Justin J. Levandoski, David B. Lomet, Sudipta Sengupta: LLAMA: A Cache/Storage Subsystem for Modern Hardware. PVLDB 6(10): 877-888 (2013) http://www.vldb.org/pvldb/vol6/p877-levandoski.pdf Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, Ryan Stutsman, Rui Wang: High Performance Transactions in Deuteronomy. CIDR 2015 http://cidrdb.org/cidr2015/Papers/CIDR15_Paper15.pdf 2/20/2025 SoCal 2019 34

QUESTIONS? 2/20/2025 SoCal 2019 35

Cost/Performance in Modern Data Stores: How Data Caching Systems Succeed

Download Presentation

Presentation Transcript

Related

More Related Content