Enhancing Scalability and Performance in Deep Recommendation Systems with EVStore

Daniar H. Kurniawan
, 
Ruipu Wang
,
Kahfi Zulkifli
, 
Fandi Wiranata
,
John Bent
, Ymir Vigfusson, 
Haryadi S. Gunawi
2
3
 
Usage:
P
ersonalized 
advertisements
, 
movies
, 
news
, 
product recommendation
, etc.
 
Impact:
4
1. Core Model
2. Embedding
Tens of 
TB
 and
growing!
Low latency
DRS is up against a scaling wall!
 
Embedding Vector (EV)
5
Off-load the embedding table
to 
specialized 
storage
Existing solution:
6
L1
7
 
Goal
: 
Scalable and performant 
DRS deployment on 
commodity 
backend
 
Design Principles
Simple
 policies
Modular
 for easy integration
 
Limitation: Stale cache problem 
 
    flush the cache
 
EVStore Approach/Techniques
+
 
EVCache
 :
 Groupability-aware 
caching
+
 
EVMix
 : Mix
-precision caching
+
 
EVProx
 : 
Embedding 
approx
imation
8
Background & Motivation
EVStore Overview
EVStore: Design
EVCache
: Groupability-aware caching
EVMix
: Mixed-precision caching
EVProx
: Embedding approximation
EVStore Evaluation
Overhead
Summary
9
lookup(A
1
, 
B
4
, C
6
, . .  Z
9
)
All hit = PERFECT hit
10
A1
B1
C1
D1
E1
The 
1
st
 Lookup
 
EMPTY
Cache State
(
after
 lookup)
11
Cache State
(
prior
 lookup)
A1
B1
C1
D1
E1
 
0
 
0
 
0
 
0
 
0
A1
B1
C1
D2
E2
The 
2
nd
 Lookup
Cache State
(
after
 lookup)
12
Cache State
(
prior
 lookup)
A1
B1
C1
D1
E1
 
3
 
3
 
3
 
0
 
0
D2
E2
 
3
 
3
A1
B1
C1
D2
E3
The 
3
rd
 Lookup
13
A1
B1
C1
D1
E1
 
4
 
4
 
4
 
0
 
0
D2
E2
 
4
 
3
E3
 
4
 
Eviction algorithm:
1.
Check 
Groupability
2.
Check 
Recency
Prior
 Eviction
14
+
15
+
16
+
Enable 
accuracy vs latency 
tradeoff
17
EVStore
3 
: 
Embedding value 
approx
imation (
EVProx
)
+
+
 
Surrogate keys
 
1.
Run 
similarity
 analysis
2.
Analyze key’s 
popularity
Surrogate key 
 The 
most popular 
key out of N-most 
similar
 keys
Euclidian 
and
Cosine 
dist.
 
preprocessing 
:
18
EVProx only uses
4% 
of the cache size
End-to-end Inference Latency (ms)
19
EVStore
: Data Life Cycle
No duplicate 
items
Minimize data trashing
Goal :
20
Background & Motivation
EVStore Overview
EVStore Design
EVCache
: Groupability-aware caching
EVMix
: Mixed-precision caching
EVProx
: Embedding approximation
EVStore Evaluation
Overhead
Summary
21
Avazu, Criteo Kaggle,
Criteo Terabyte
22
EVCache
 variants get up to 
10% higher perfect hit 
rate!
23
24
EV-LFU shows 
steeper benefit 
as the number of EV tables 
grows!
EV-LFU 
outperform others 
on various datasets!
25
A gap!
26
An optimum place to 
deploy EVCache 
is 
in the DLRM!
External 
caching layer is 
NOT
 effective!
27
Goal: 
Find the most performant
Implementation of EVMix layers
Setup
-
Baseline: 
Python-based
caching layer within the DLRM
-
Precision
: 32bit
20% faster
28
SLA : 
Inference Latency 
< 2 ms
    Accuracy drop 
< 0.2%
29
30
The importance of perfect hits
EVCache variants implementation
EVMix’s bit-coding optimization
More evaluation results
Trade-off between latency and
accuracy
. . .
31
 
1.
Memory
No duplicate value
The embedding value is stored as it is
Storing groupability-score        
Negligible
 overhead (<
1
%)
2.
Computation
Offline
 overhead        mapping surrogate keys
Online
 overhead (
negligible
)
-
Converting low precision to full precision
-
Managing the cache eviction
-
Decoding stream of data from C++ to Python side
3.
Storage (SSD)
Storing multiple precisions of the same value       
2x
 Overhead
32
 
Storage and Caching Capabilities 
for Scaling
Embedding Tables in
 Deep Recommendation System
 
EVCache
: groupability-aware caching
EVMix
: mixed-precision caching
EVProx
: embedding value approximation
 
EVStore code
: 
https://github.com/
ucare-uchicago
/
ev-store-dlrm
33
Thank you!
I’m on the job market. <daniar@uchicago.edu>
Slide Note
Embed
Share

EVStore presents a novel approach to scaling embedding tables in deep recommendation systems, offering a three-layer caching system that optimizes storage and caching capabilities. By leveraging groupability-aware caching, mixed-precision caching, and embedding approximation, EVStore achieves lighter memory usage, faster inference times, and enhanced model collocation. The solution addresses the challenges of scalability and performance in deploying recommendation systems on commodity backend infrastructure, providing a comprehensive strategy for efficient storage and retrieval of embedding data.

  • Scalability
  • Performance
  • Recommendation Systems
  • Caching
  • Embedding Tables

Uploaded on Apr 16, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. EVStore: Storage and Caching Capabilities for Scaling Embedding Tables in Deep Recommendation Systems Daniar H. Kurniawan, Ruipu Wang, Kahfi Zulkifli, Fandi Wiranata, John Bent, Ymir Vigfusson, Haryadi S. Gunawi

  2. 2 VANC UVER

  3. 3 Deep Recommendation Systems (DRS) Usage: Personalized advertisements, movies, news, product recommendation, etc. 30% 40% 60% 75% Impact: What s inside? Similarvideos are suddenly being recommended to me I watch ??? Database ML Models Embedding Table

  4. 4 DRS model deployment Similar objects will appear closer in the vector space 1. Core Model 2. Embedding RAM Tens of TB and growing! Kitten Dog Apple Cat 0.3 - 0.3 0.2 0.1 0.3 - 0.3 0.2 0.2 Low latency Poor scalability Very expensive - 0.1 0.3 - 0.1 - 0.1 Embedding Vector (EV) DRS is up against a scaling wall! Embedding-lookup = ~40% of inference latency

  5. 5 Existing solution: Ours: RAM RAM EVStore: A Three-Layer Caching System Specialized Storage HW L3 L1 L2 + + Off-load the embedding table to specialized storage Commodity SSD 94% Lighter memory 23% Faster inference 4x Require HW Modification Complex to use Model collocation

  6. 6 EVStore Overview Let s enable approximation! Let s cache the smallerprecision! Existing algorithms (e.g., LRU, Clock, LIRS, LeCAR) are not effective! EVMix: mixed- precision caching EVProx: Approximate Embedding EVCache: Optimized for multi-table lookups L1 L2 L3 L1 L2 L1

  7. 7 The Story of EVStore Goal: Scalable and performant DRS deployment on commodity backend Design Principles Simple policies Modular for easy integration Limitation: Stale cache problem flush the cache EVStore Approach/Techniques + + EVCache : Groupability-aware caching + + EVMix : Mix-precision caching + + EVProx : Embedding approximation 94% Lighter memory L1 L1 L2 L1 L2 L3 23% Faster inference 4x Model collocation

  8. 8 Background & Motivation EVStore Overview EVStore: Design EVCache: Groupability-aware caching EVMix: Mixed-precision caching EVProx: Embedding approximation EVStore Evaluation Overhead Summary

  9. 9 EVStore1 : Groupability-aware caching (EVCache) Extract groupability relationships between items to improve perfect-hit EVCache : L1 lookup(A1, B4, C6, . . Z9) All hit = PERFECT hit Recency Frequency Groupability Other algorithms EVCache (ours) Optimized to improve the perfect hit rate

  10. 10 Groupability Scoring (1/3) Cache State (prior lookup) Cache State (after lookup) 0 0 0 0 0 A1 Newly inserted The 1st Lookup B1 Newly inserted C1 Newly inserted A1 B1 C1 D1 E1 D1 EMPTY Newly inserted E1 #hit = 0 Newly inserted

  11. 11 Groupability Scoring (2/3) Cache State (prior lookup) Cache State (after lookup) 3 3 3 0 0 3 3 0 0 0 0 0 A1 A1 Updated The 2nd Lookup B1 B1 Updated C1 A1 B1 C1 D2 E2 C1 Updated D1 D1 #hit = 3 E1 E1 D2 Newly inserted E2 Newly inserted

  12. 12 Groupability Scoring (3/3) Cache State (prior lookup) Cache State (after lookup) 3 3 3 0 0 3 3 4 4 4 0 0 4 3 4 A1 A1 Updated The 3rd Lookup B1 B1 Updated C1 C1 A1 B1 C1 D2 E3 Updated D1 D1 #hit = 4 E1 E1 D2 D2 Updated E2 E2 E3 Newly inserted

  13. 13 EVCache Eviction Prior Eviction After Eviction 4 4 4 D1 0 0 4 3 E3 4 4 4 4 0 0 4 3 A1 A1 Eviction algorithm: 1. Check Groupability B1 B1 2. Check Recency C1 C1 Evicted! D1 Eviction candidates E1 E1 Least recent! 0 D1 D2 D2 0 E1 E2 E2 E3 4

  14. 14 EVStore2 : Mixed-precision caching (EVMix) Extract groupability relationships between items to improve perfect-hit + + EVCache : L1 Split the cache into two layers, with each layer storing different precision EVMix : L1 L2 Lower precision = faster inference latency

  15. 15 EVStore2 : Mixed-precision caching (EVMix) Extract groupability relationships between items to improve perfect-hit + + EVCache : L1 Split the cache into two layers, with each layer storing different precision EVMix : L1 L2 Lower precision Cache more key-value pairs The more data being cached, the higher perfect-hit

  16. 16 EVStore2 : Mixed-precision caching (EVMix) Extract groupability relationships between items to improve perfect-hit + + EVCache : L1 Split the cache into two layers, with each layer storing different precision EVMix : L1 L2 Higher precision Lower precision Higher accuracy Faster latency Enable accuracy vs latency tradeoff

  17. 17 EVStore3 : Embedding value approximation (EVProx) Extract groupability relationships between items to improve perfect-hit + + EVCache : L1 Split the cache into two layers, with each layer storing different precision + + EVMix : L1 L2 Enable key-to-key caching to find an approximately similar embedding value EVProx : L1 L2 L3 Euclidian and Cosine dist. 1. Run similarity analysis 2. Analyze key s popularity preprocessing : Surrogate keys Surrogate key The most popular key out of N-most similar keys

  18. 18 The Effectiveness of EVProx The tail is shifted from p80 to p98 EVProx only uses 4% of the cache size End-to-end Inference Latency (ms) Small but powerful

  19. 19 EVStore: Data Life Cycle Goal : No duplicate items Minimize data trashing L1 Reduce data precision L2 Only store the key L3

  20. 20 Background & Motivation EVStore Overview EVStore Design EVCache: Groupability-aware caching EVMix: Mixed-precision caching EVProx: Embedding approximation EVStore Evaluation Overhead Summary

  21. 21 EVStore Stack and Evaluation Setup batch size = 1 Avazu, Criteo Kaggle, Criteo Terabyte User Online Inference Workloads 1. Inference latency DL Engine PyTorch Metrics 2. Accuracy 3. Perfect-hit rate Metric EVCache EVStore EVProx EVMix vs. State-of-the-art LRU LFU ClockPro CAR Cached Embedding Memory (CPU-DRAM) Clock LeCAR DLIRS LIRS SSDs Raw Embedding Tables ARC ClockLIRS CACHEUS

  22. 22 EVCache vs State-of-the-art EVCache variants! EVCache variants get up to 10% higher perfect hit rate!

  23. 23 Perfect-hit vs Individual-hit

  24. 24 EVCache on Various Workloads (ours) vs. State-of-the-art Fundamental ML-based Adaptive LRU CACHEUS ClockPro LFU LeCAR ClockLIRS Clock CAR ARC LIRS DLIRS EV-LFU shows steeper benefit as the number of EV tables grows! EV-LFU outperform others on various datasets!

  25. 25 Perfect-hit rate on various cache sizes A gap! 35%

  26. 26 Where to put the caching layer? External caching layer is NOT effective! An optimum place to deploy EVCache is in the DLRM!

  27. 27 How to implement EVMix? 20% faster Goal: Find the most performant Implementation of EVMix layers Setup - Baseline: Python-based caching layer within the DLRM Precision: 32bit -

  28. 28 EVStore Orthogonal Evaluation SLA : Inference Latency < 2 ms Accuracy drop < 0.2% Memory Usage! Meta s DLRM baseline EVCache_Py + LRU EVCache_Py + EVLFU EVCache_Cpp + EVLFU (32bit) EVCache_Cpp + EVLFU (16bit) EVCache_Cpp + EVMix (8bit + 4bit) EVStore (EVCache + EVMix + EVProx) Fully optimized EVStorereduces the memory usage by up to 94%

  29. 29 Collocating Multiple DRS 1 DRS model 4 DRS models

  30. 30 More in the Paper! The importance of perfect hits EVCache variants implementation EVMix s bit-coding optimization More evaluation results Trade-off between latency and accuracy . . .

  31. 31 EVStore Overhead 1. Memory No duplicate value The embedding value is stored as it is Storing groupability-score Negligible overhead (<1%) 2. Computation Offline overhead mapping surrogate keys Online overhead (negligible) - Converting low precision to full precision - Managing the cache eviction - Decoding stream of data from C++ to Python side 3. Storage (SSD) Storing multiple precisions of the same value 2x Overhead

  32. 32 EVStore Summary Storage and Caching Capabilities for Scaling Embedding Tables in Deep Recommendation System EVCache: groupability-aware caching EVMix: mixed-precision caching EVProx: embedding value approximation L3 L1 L2 + 23% Faster inference + 4x 94% Lighter memory Model collocation EVStore code: https://github.com/ucare-uchicago/ev-store-dlrm

  33. 33 Thank you! I m on the job market. <daniar@uchicago.edu>

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#