in In-Memory Key-Value Storage

E
r
a
s
u
r
e
 
C
o
d
i
n
g
 
f
o
r
 
S
m
a
l
l
 
O
b
j
e
c
t
s
i
n
 
I
n
-
M
e
m
o
r
y
 
K
e
y
-
V
a
l
u
e
 
S
t
o
r
a
g
e
M
a
t
t
 
M
.
 
T
.
 
Y
i
u
,
 
H
e
l
e
n
 
H
.
 
W
.
 
C
h
a
n
,
 
P
a
t
r
i
c
k
 
P
.
 
C
.
 
L
e
e
The Chinese University of Hong Kong
SYSTOR 2017
1
I
n
t
r
o
d
u
c
t
i
o
n
 
In-memory key-value (KV) stores are widely deployed for
scalable, low-latency access
Examples: Memcached, Redis, VoltDB, RAMCloud
Failures are prevalent in distributed storage systems
Replication in DRAM?
High storage overheads
Replication in secondary storage (e.g., HDDs)?
High latency to replicas (especially for random I/Os)
E
r
a
s
u
r
e
 
c
o
d
i
n
g
Minimum data redundancy
Redundant information is stored 
entirely
 in memory for low-latency
accesses 
 
fast recovery under stragglers and failures
2
E
r
a
s
u
r
e
 
C
o
d
i
n
g
3
C
h
a
l
l
e
n
g
e
s
Erasure coding is expensive in 
data updates 
and 
failure
recovery
Many solutions in the literature
R
e
a
l
-
l
i
f
e
 
i
n
-
m
e
m
o
r
y
 
s
t
o
r
a
g
e
 
w
o
r
k
l
o
a
d
s
 
a
r
e
 
d
o
m
i
n
a
t
e
d
 
b
y
s
m
a
l
l
-
s
i
z
e
 
o
b
j
e
c
t
s
Keys and values can be as small as few bytes (e.g., 2-3 bytes of
values) 
[Atikoglu, Sigmetrics’12]
Erasure coding is often used for large objects
I
n
-
m
e
m
o
r
y
 
K
V
 
s
t
o
r
e
s
 
i
s
s
u
e
 
d
e
c
e
n
t
r
a
l
i
z
e
d
 
r
e
q
u
e
s
t
s
w
i
t
h
o
u
t
 
c
e
n
t
r
a
l
i
z
e
d
 
m
e
t
a
d
a
t
a
 
l
o
o
k
u
p
Need to maintain data consistency when failures happen
4
O
u
r
 
C
o
n
t
r
i
b
u
t
i
o
n
s
B
u
i
l
d
 
M
e
m
E
C
,
 
a
 
h
i
g
h
-
a
v
a
i
l
a
b
i
l
i
t
y
,
 
e
r
a
s
u
r
e
-
c
o
d
i
n
g
-
b
a
s
e
d
i
n
-
m
e
m
o
r
y
 
K
V
 
s
t
o
r
e
 
t
h
a
t
 
a
i
m
s
 
f
o
r
Low-latency access
Fast recovery  (under stragglers/failures)
Storage-efficient
P
r
o
p
o
s
e
 
a
 
n
e
w
 
a
l
l
-
e
n
c
o
d
i
n
g
 
d
a
t
a
 
m
o
d
e
l
Ensure graceful transitions between normal mode and
degraded mode
Evaluate MemEC prototype with YCSB workloads
5
E
x
i
s
t
i
n
g
 
D
a
t
a
 
M
o
d
e
l
s
A
l
l
-
r
e
p
l
i
c
a
t
i
o
n
Store multiple replicas for each object in memory
Used by many KV stores (e.g., Redis)
Node #1
Node #2
Node #
i
...
6
E
x
i
s
t
i
n
g
 
D
a
t
a
 
M
o
d
e
l
s
H
y
b
r
i
d
-
e
n
c
o
d
i
n
g
Assumption: Value size is sufficiently large
Erasure coding to values only
Replication for key, metadata, and reference to the object
Used by LH*RS 
[TODS‘05]
, Cocytus 
[FAST‘16]
Node #1
Node #2
Node #
k
...
Value
Value
Value
Node #(
k
+1)
Parity
Node #
n
Parity
...
Replication
Erasure
coding
7
O
u
r
 
d
a
t
a
 
m
o
d
e
l
:
 
A
l
l
-
e
n
c
o
d
i
n
g
A
p
p
l
y
 
e
r
a
s
u
r
e
 
c
o
d
i
n
g
 
t
o
 
o
b
j
e
c
t
s
 
i
n
 
e
n
t
i
r
e
t
y
Design specific index structures to limit storage
8
A
l
l
-
e
n
c
o
d
i
n
g
:
 
D
a
t
a
 
O
r
g
a
n
i
z
a
t
i
o
n
Divide storage into fixed-
size chunks (4 KB) as units
of erasure coding
A unique fixed-size chunk
ID (8 bytes) for chunk
identification in a server
9
A
l
l
-
e
n
c
o
d
i
n
g
:
 
D
a
t
a
 
O
r
g
a
n
i
z
a
t
i
o
n
Each data chunk contains
multiple objects
Each object starts with
fixed-size metadata,
followed by variable-size
key and value
10
A
l
l
-
e
n
c
o
d
i
n
g
:
 
D
a
t
a
 
O
r
g
a
n
i
z
a
t
i
o
n
Append new objects to a data chunk
until the chunk size limit is reached, and
seal
 the data chunk
Sealed data chunks are encoded to form
parity chunks belonging to same stripe
11
A
l
l
-
e
n
c
o
d
i
n
g
:
 
D
a
t
a
 
O
r
g
a
n
i
z
a
t
i
o
n
Chunk index 
maps a chunk ID
to a chunk reference
Object index 
maps a key to an
object reference
Use 
cuckoo hashing
No need to keep redundancy
for both indexes in memory
12
 
Key-to-chunk mappings are needed
for failure recovery, but can be
stored in secondary storage
A
l
l
-
e
n
c
o
d
i
n
g
:
 
C
h
u
n
k
 
I
D
Chunk ID has three fields:
S
t
r
i
p
e
 
l
i
s
t
 
I
D
:
 
i
d
e
n
t
i
f
y
i
n
g
 
t
h
e
 
s
e
t
 
o
f
 
n
 
d
a
t
a
a
n
d
 
p
a
r
i
t
y
 
s
e
r
v
e
r
s
 
f
o
r
 
t
h
e
 
s
t
r
i
p
e
Determined by hashing a key
S
t
r
i
p
e
 
I
D
:
 
i
d
e
n
t
i
f
y
i
n
g
 
t
h
e
 
s
t
r
i
p
e
Each server increments a local counter
when a data chunk is sealed
C
h
u
n
k
 
p
o
s
i
t
i
o
n
:
 
f
r
o
m
 
0
 
t
o
 
n
 
 
1
Chunks of the same stripe has the
same stripe list ID and same stripe ID
Main Memory
Chunk ID
O
1
O
2
O
3
O
4
O
5
O
6
Chunk ID
O
1
O
2
O
3
O
4
O
5
O
6
Chunk ID
O
1
O
2
O
3
O
4
O
5
O
6
...
8 bytes +
4 KB
8 bytes +
4 KB
8 bytes +
4 KB
13
A
n
a
l
y
s
i
s
All-encoding achieves much lower redundancy
14
M
e
m
E
C
 
A
r
c
h
i
t
e
c
t
u
r
e
Unified memory
Server
Proxy
Client
Proxy
...
Coordinator
Server
Server
...
Object
SET / GET /
UPDATE /
DELETE
15
Coordinator on I/O path
only in degraded mode
Client
M
e
m
E
C
 
S
E
T
Proxy
Data Server
Data Server
Data Server
Data Server
Parity Server
Parity Server
Request:
Key
Value
SET request
Key
Value
Coordinator
Parallel
16
M
e
m
E
C
 
G
E
T
Proxy
Data Server
Data Server
Data Server
Data Server
Parity Server
Parity Server
Request:
Key
GET request
Key
17
M
e
m
E
C
 
U
P
D
A
T
E
/
D
E
L
E
T
E
Proxy
Data Server
Data Server
Data Server
Data Server
Parity Server
Parity Server
UPDATE_CHUNK
request
Chunk identifier
Data delta
Parallel
UPDATE request:
Key
Modified range
Modified data
DELETE request:
Key
18
F
a
u
l
t
 
T
o
l
e
r
a
n
c
e
I
n
 
n
o
r
m
a
l
 
m
o
d
e
,
 
r
e
q
u
e
s
t
s
 
a
r
e
 
d
e
c
e
n
t
r
a
l
i
z
e
d
Coordinator is not on I/O path
W
h
e
n
 
a
 
s
e
r
v
e
r
 
f
a
i
l
s
,
 
p
r
o
x
i
e
s
 
m
o
v
e
 
f
r
o
m
 
d
e
c
e
n
t
r
a
l
i
z
e
d
r
e
q
u
e
s
t
s
 
t
o
 
d
e
g
r
a
d
e
d
 
r
e
q
u
e
s
t
s
 
m
a
n
a
g
e
d
 
b
y
 
c
o
o
r
d
i
n
a
t
o
r
Ensure data consistency by reverting any inconsistent changes
or replaying incomplete requests
Coordinator redirects degraded requests from a failed server to
another working server
Requests that do not involve the failed server remain
decentralized
R
a
t
i
o
n
a
l
e
:
 
n
o
r
m
a
l
 
m
o
d
e
 
i
s
 
c
o
m
m
o
n
 
c
a
s
e
;
 
c
o
o
r
d
i
n
a
t
o
r
 
i
s
o
n
l
y
 
i
n
v
o
l
v
e
d
 
i
n
 
d
e
g
r
a
d
e
d
 
m
o
d
e
19
S
e
r
v
e
r
 
S
t
a
t
e
s
Coordinator maintains a state for each server and
instructs all proxies how to communicate with a server
Normal
Degraded
Intermediate
Coordinated
Normal
Server
failed
Migration
completed
Server
restored
Inconsistency
resolved
20
S
e
r
v
e
r
 
S
t
a
t
e
s
All proxies and working servers share the same view of
server states
Two-phase protocol:
When coordinator detects a server failure, it notifies all proxies to
finish all decentralized requests 
(intermediate state)
Each proxy notifies coordinator when finished
Coordinator notifies all proxies to issues degraded requests via
coordinator 
(degraded state)
Implemented via atomic broadcast
21
E
v
a
l
u
a
t
i
o
n
Testbed under commodity settings:
16 servers
4 proxies
1 coordinator
1 Gbps Ethernet
YCSB benchmarking (4 instances, 64 threads each)
Key size: 24 bytes
Value size: 8 bytes and 32 bytes (large values also considered)
Do not consider range queries
22
I
m
p
a
c
t
 
o
f
 
T
r
a
n
s
i
e
n
t
 
F
a
i
l
u
r
e
s
Failures occur before 
load phase
:
L
atency of SET in load phase
increases by 11.5% with
degraded request handing
For Workload A, latencies of
UPDATE and GET increase by
53.3% and 38.2%, resp.
23
I
m
p
a
c
t
 
o
f
 
T
r
a
n
s
i
e
n
t
 
F
a
i
l
u
r
e
s
F
ailures occur after load phase:
Latencies of GET and UPDATE
increase by 180.3% and
177.5%, resp.
L
atency of GET in Workload C
only increase by 6.69%
24
S
t
a
t
e
 
T
r
a
n
s
i
t
i
o
n
 
O
v
e
r
h
e
a
d
Average elapsed times of state
transitions with 95% confidence
D
ifference between two elapsed
times is mainly caused by
reverting parity updates
 of
incomplete requests
Elapsed time includes 
data
migration 
from the redirected
server to the restored server, so
increases a lot
25
C
o
n
c
l
u
s
i
o
n
A
 
c
a
s
e
 
o
f
 
a
p
p
l
y
i
n
g
 
e
r
a
s
u
r
e
 
c
o
d
i
n
g
 
t
o
 
b
u
i
l
d
 
a
 
h
i
g
h
-
a
v
a
i
l
a
b
l
e
i
n
-
m
e
m
o
r
y
 
K
V
 
s
t
o
r
e
:
 
M
e
m
E
C
Enable fast recovery by keeping redundancy entirely in memory
Two key designs:
Support of small objects
Graceful transition between decentralized requests in normal mode
and coordinated degraded requests in degraded mode
Prototype and experiments
S
o
u
r
c
e
 
c
o
d
e
:
 
h
t
t
p
s
:
/
/
g
i
t
h
u
b
.
c
o
m
/
m
t
y
i
u
/
m
e
m
e
c
26
Slide Note
Embed
Share

In-memory key-value stores play a crucial role in modern data storage systems, but face challenges with failures, latency, and redundancy. Erasure coding offers a solution by efficiently storing redundant information in memory for fast recovery. Learn about the benefits, challenges, and contributions of implementing erasure coding in high-availability in-memory KV stores.

  • Erasure Coding
  • In-Memory Storage
  • Key-Value
  • Fault Tolerance
  • Data Redundancy

Uploaded on Mar 07, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Erasure Coding for Small Objects in In-Memory Key-Value Storage Matt M. T. Yiu, Helen H. W. Chan, Patrick P. C. Lee The Chinese University of Hong Kong SYSTOR 2017 1

  2. Introduction In-memory key-value (KV) stores are widely deployed for scalable, low-latency access Examples: Memcached, Redis, VoltDB, RAMCloud Failures are prevalent in distributed storage systems Replication in DRAM? High storage overheads Replication in secondary storage (e.g., HDDs)? High latency to replicas (especially for random I/Os) Erasure coding Minimum data redundancy Redundant information is stored entirely in memory for low-latency accesses fast recovery under stragglers and failures 2

  3. Erasure Coding Divide data to kdata chunks Encode data chunks to additional n-k parity chunks Each collection of n data/parity chunks is called a stripe Distribute each stripe to n different nodes Many stripes are stored in large-scale systems Fault tolerance: any k out of n nodes can recover file data Redundancy: ? ? 3

  4. Challenges Erasure coding is expensive in data updates and failure recovery Many solutions in the literature Real-life in-memory storage workloads are dominated by small-size objects Keys and values can be as small as few bytes (e.g., 2-3 bytes of values) [Atikoglu, Sigmetrics 12] Erasure coding is often used for large objects In-memory KV stores issue decentralized requests without centralized metadata lookup Need to maintain data consistency when failures happen 4

  5. Our Contributions Build MemEC, a high-availability, erasure-coding-based in-memory KV store that aims for Low-latency access Fast recovery (under stragglers/failures) Storage-efficient Propose a new all-encoding data model Ensure graceful transitions between normal mode and degraded mode Evaluate MemEC prototype with YCSB workloads 5

  6. Existing Data Models All-replication Store multiple replicas for each object in memory Used by many KV stores (e.g., Redis) Key Key Key Value Value Value ... Metadata Metadata Metadata Reference Reference Reference Node #1 Node #2 Node #i 6

  7. Existing Data Models Hybrid-encoding Assumption: Value size is sufficiently large Erasure coding to values only Replication for key, metadata, and reference to the object Used by LH*RS [TODS 05], Cocytus [FAST 16] Erasure coding Value Value Parity Parity Value Key Key Key Key Key ... ... Metadata Metadata Metadata Metadata Metadata Replication Reference Reference Reference Reference Reference Node #1 Node #2 Node #k Node #(k+1) Node #n 7

  8. Our data model: All-encoding Apply erasure coding to objects in entirety Design specific index structures to limit storage 8

  9. All-encoding: Data Organization Divide storage into fixed- size chunks (4 KB) as units of erasure coding A unique fixed-size chunk ID (8 bytes) for chunk identification in a server 9

  10. All-encoding: Data Organization Each data chunk contains multiple objects Each object starts with fixed-size metadata, followed by variable-size key and value 10

  11. All-encoding: Data Organization Append new objects to a data chunk until the chunk size limit is reached, and seal the data chunk Sealed data chunks are encoded to form parity chunks belonging to same stripe 11

  12. All-encoding: Data Organization Chunk index maps a chunk ID to a chunk reference Object index maps a key to an object reference Use cuckoo hashing No need to keep redundancy for both indexes in memory Key-to-chunk mappings are needed for failure recovery, but can be stored in secondary storage 12

  13. All-encoding: Chunk ID Chunk ID has three fields: Stripe list ID: identifying the set of n data and parity servers for the stripe Determined by hashing a key Stripe ID: identifying the stripe Each server increments a local counter when a data chunk is sealed Chunk position: from 0 to n 1 Chunk ID O1 8 bytes + 4 KB O2 O3 O6 O4 O5 Chunk ID O1 8 bytes + 4 KB O2 O3 O6 O4 O5 Chunk ID O1 Chunks of the same stripe has the same stripe list ID and same stripe ID 8 bytes + 4 KB O2 O3 O6 O4 O5 ... Main Memory 13

  14. Analysis All-encoding achieves much lower redundancy 14

  15. MemEC Architecture Client Coordinator on I/O path only in degraded mode Object Coordinator SET / GET / UPDATE / DELETE Unified memory Proxy Client Server Server Server Proxy 15

  16. MemEC SET Coordinator Data Server Request: Key Value Data Server SET request Key Value Proxy Data Server Parallel Data Server Parity Server Parity Server 16

  17. MemEC GET Data Server GET request Key Data Server Request: Key Proxy Data Server Data Server Parity Server Parity Server 17

  18. MemEC UPDATE/DELETE UPDATE request: Key Modified range Modified data Data Server UPDATE_CHUNK request Chunk identifier Data delta Data Server Proxy Data Server DELETE request: Key Data Server Parallel Parity Server Parity Server 18

  19. Fault Tolerance In normal mode, requests are decentralized Coordinator is not on I/O path When a server fails, proxies move from decentralized requests to degraded requests managed by coordinator Ensure data consistency by reverting any inconsistent changes or replaying incomplete requests Coordinator redirects degraded requests from a failed server to another working server Requests that do not involve the failed server remain decentralized Rationale: normal mode is common case; coordinator is only involved in degraded mode 19

  20. Server States Coordinator maintains a state for each server and instructs all proxies how to communicate with a server Inconsistency resolved Intermediate Server failed Normal Degraded Migration completed Server restored Coordinated Normal 20

  21. Server States All proxies and working servers share the same view of server states Two-phase protocol: When coordinator detects a server failure, it notifies all proxies to finish all decentralized requests (intermediate state) Each proxy notifies coordinator when finished Coordinator notifies all proxies to issues degraded requests via coordinator (degraded state) Implemented via atomic broadcast 21

  22. Evaluation Testbed under commodity settings: 16 servers 4 proxies 1 coordinator 1 Gbps Ethernet YCSB benchmarking (4 instances, 64 threads each) Key size: 24 bytes Value size: 8 bytes and 32 bytes (large values also considered) Do not consider range queries 22

  23. Impact of Transient Failures Failures occur before load phase: Latency of SET in load phase increases by 11.5% with degraded request handing For Workload A, latencies of UPDATE and GET increase by 53.3% and 38.2%, resp. 23

  24. Impact of Transient Failures Failures occur after load phase: Latencies of GET and UPDATE increase by 180.3% and 177.5%, resp. Latency of GET in Workload C only increase by 6.69% 24

  25. State Transition Overhead Difference between two elapsed times is mainly caused by reverting parity updates of incomplete requests Elapsed time includes data migration from the redirected server to the restored server, so increases a lot Average elapsed times of state transitions with 95% confidence 25

  26. Conclusion A case of applying erasure coding to build a high-available in-memory KV store: MemEC Enable fast recovery by keeping redundancy entirely in memory Two key designs: Support of small objects Graceful transition between decentralized requests in normal mode and coordinated degraded requests in degraded mode Prototype and experiments Source code: https://github.com/mtyiu/memec 26

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#