Fast Multicore Key-Value Storage Study

Slide Note
Embed
Share

"Explore Cache Craftiness for Fast Multicore Key-Value Storage in a comprehensive study on building a high-performance KV store system. Learn about the feature wishlist, challenges with hard workloads, initial attempts with binary trees, and advancements with Masstree. Discover the contributions and experiment environment of this cutting-edge research."


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

  2. Lets build a fast key-value store KV store systems are important Google Bigtable, Amazon Dynamo, Yahoo! PNUTS Single-server KV performance matters Reduce cost Easier management Goal: fast KV store for single multi-core server Assume all data fits in memory Redis, VoltDB

  3. Feature wish list Clients send queries over network Persist data across crashes Range query Perform well on various workloads Including hard ones!

  4. Hard workloads Skewed key popularity Hard! (Load imbalance) Small key-value pairs Hard! Many puts Hard! Arbitrary keys String (e.g. www.wikipedia.org/...) or integer Hard!

  5. First try: fast binary tree 140M short KV, put-only, @16 cores 4 Throughput (req/sec, millions) Network/disk not bottlenecks High-BW NIC Multiple disks 4 3 3 2 3.7 million queries/second! 2 1 Better? What bottleneck remains? DRAM! 1 0

  6. Cache craftiness goes 1.5X farther 140M short KV, put-only, @16 cores 7 Throughput (req/sec, millions) 6 5 4 3 2 1 0 Binary Masstree Cache-craftiness: careful use of cache and memory

  7. Contributions Masstree achieves millions of queries per second across various hard workloads Skewed key popularity Various read/write ratios Variable relatively long keys Data >> on-chip cache New ideas Trie of B+ trees, permuter, etc. Full system New ideas + best practices (network, disk, etc.)

  8. Experiment environment A 16-core server three active DRAM nodes Single 10Gb Network Interface Card (NIC) Four SSDs 64 GB DRAM A cluster of load generators

  9. Potential bottlenecks in Masstree Network DRAM log log Disk Single multi-core server

  10. NIC bottleneck can be avoided Single 10Gb NIC Multiple queue, scale to many cores Target: 100B KV pair => 10M/req/sec Use network stack efficiently Pipeline requests Avoid copying cost

  11. Disk bottleneck can be avoided 10M/puts/sec => 1GB logs/sec! Single disk Write throughput Cost Mainstream Disk 100-300 MB/sec 1 $/GB High performance SSD up to 4.4GB/sec > 40 $/GB Multiple disks: split log See paper for details Single multi-core server

  12. DRAM bottleneck hard to avoid 140M short KV, put-only, @16 cores Cache-craftiness goes 1.5X father, including the cost of: Network Disk 7 Throughput (req/sec, millions) 6 5 4 3 2 1 0 Binary Masstree

  13. DRAM bottleneck w/o network/disk 140M short KV, put-only, @16 cores 10 9 Throughput (req/sec, millions) Cache-craftiness goes 1.7X father! 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree

  14. DRAM latency binary tree 140M short KV, put-only, @16 cores 6 Throughput (req/sec, millions) 5 4 B A C 3 O ???2? serial DRAM latencies! Y 2 Z X 1 2.7 us/lookup 380K lookups/core/sec 0 10M keys => Binary VoltDB

  15. DRAM latency Lock-free 4-way tree Concurrency: same as binary tree One cache line per node => 3 KV / 4 children X Y Z levels as binary tree DRAM latencies as binary tree A B

  16. 4-tree beats binary tree by 40% 140M short KV, put-only, @16 cores 10 9 Throughput (req/sec, millions) 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree

  17. 4-tree may perform terribly! Unbalanced: O ? serial DRAM latencies e.g. sequential inserts A B C D E F O(N) levels! G H I Want balanced tree w/ wide fanout

  18. B+tree Wide and balanced Balanced! Concurrent main memory B+tree [OLFIT] Optimistic concurrency control: version technique Lookup/scan is lock-free Puts hold 3 per-node locks

  19. Wide fanout B+tree is 11% slower! 140M short KV, put-only Fanout=15, fewer levels than 4-tree, but # cache lines from DRAM >= 4-tree 4-tree: each internal node is full B+tree: nodes are ~75% full Serial DRAM latencies >= 4-tree 10 9 Throughput (req/sec, millions) 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree

  20. B+tree Software prefetch Same as [pB+-trees] 4 lines = 1 line Masstree: B+tree w/ fanout 15 => 4 cache lines Always prefetch whole node when accessed Result: one DRAM latency per node vs. 2, 3, or 4

  21. B+tree with prefetch 140M short KV, put-only, @16 cores Beats 4-tree by 9% Balanced beats unbalanced! 10 9 Throughput (req/sec, millions) 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree

  22. Concurrent B+tree problem Lookups retry in case of a concurrent insert insert(B) A C D A C D Intermediate state! A B C D Lock-free 4-tree: not a problem keys do not move around but unbalanced

  23. B+tree optimization - Permuter Keys stored unsorted, define order in tree nodes Permuter: 64-bit integer insert(B) A C D A C D B A C D B 0 12 0 31 2 A concurrent lookup does not need to retry Lookup uses permuter to search keys Insert appears atomic to lookups

  24. B+tree with permuter 140M short KV, put-only, @16 cores 10 Improve by 4% 9 Throughput (req/sec, millions) 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree

  25. Performance drops dramatically when key length increases Short values, 50% updates, @16 cores, no logging 9 Throughput (req/sec, millions) 8 7 Why? Stores key suffix indirectly, thus each key comparison compares full key extra DRAM fetch 6 5 4 3 2 1 0 8 16 24 32 40 48 Key length Keys differ in last 8B

  26. Masstree Trie of B+trees Trie: a tree where each level is indexed by fixed-length key fragment B+tree, indexed by k[0:7] Masstree: a trie with fanout 264, but each trie node is a B+tree B+tree, indexed by k[8:15] Compress key prefixes! B+tree, indexed by k[16:23]

  27. Case Study: Keys share P byte prefix Better than single B+tree ? 8 trie levels each has one node only A single B+tree with 8B keys Complexity DRAM access Masstree O(log N) O(log N) Single B+tree O(P log N) O(P log N)

  28. Masstree performs better for long keys with prefixes Short values, 50% updates, @16 cores, no logging 10 Throughput (req/sec, millions) 9 8 7 6 8B key comparison vs. full key comparison Masstree 5 B+tree 4 3 2 1 0 8 16 24 32 40 48 Key length

  29. Does trie of B+trees hurt short key performance? 140M short KV, put-only, @16 cores 8% faster! More efficient code internal node handle 8B keys only 10 Throughput (req/sec, millions) 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree

  30. Evaluation Masstree compare to other systems? Masstree compare to partitioned trees? How much do we pay for handling skewed workloads? Masstree compare with hash table? How much do we pay for supporting range queries? Masstree scale on many cores?

  31. Masstree performs well even with persistence and range queries 20M short KV, uniform dist., read-only, @16 cores, w/ network Memcached: not persistent and no range queries 12 Throughput (req/sec, millions) 10 8 Redis: no range queries 6 Unfair: both have a richer data and query model 4 2 0.04 0.22 0 MongoDB VoltdB Redis Memcached Masstree

  32. Multi-core Partition among cores? Multiple instances, one unique set of keys per inst. Memcached, Redis, VoltDB B Y A X C Z Masstree: a single shared tree each core can access all keys reduced imbalance B A C Y X Z

  33. A single Masstree performs better for skewed workloads 140M short KV, read-only, @16 cores, w/ network 12 No remote DRAM access No concurrency control Masstree 16 partitioned Masstrees Throughput (req/sec, millions) 10 8 Partition: 80% idle time 1 partition: 40% 15 partitions: 4% 6 4 2 0 0 1 2 3 4 5 6 7 8 9 One partition receives times more queries

  34. Cost of supporting range queries Without range query? One can use hash table No resize cost: pre-allocate a large hash table Lock-free: update with cmpxchg Only support 8B keys: efficient code 30% full, each lookup = 1.1 hash probes Measured in the Masstree framework 2.5X the throughput of Masstree Range query costs 2.5X in performance

  35. Scale to 12X on 16 cores Short KV, w/o logging 0.7 Perfect scalability Throughput (req/sec/core, millions) 0.6 0.5 0.4 Get Scale to 12X Put scales similarly Limited by the shared memory system 0.3 0.2 0.1 0 1 2 4 8 16 Number of cores

  36. Related work [OLFIT]: Optimistic Concurrency Control [pB+-trees]: B+tree with software prefetch [pkB-tree]: store fixed # of diff. bits inline [PALM]: lock-free B+tree, 2.3X as [OLFIT] Masstree: first system combines them together, w/ new optimizations Trie of B+trees, permuter

  37. Summary Masstree: a general-purpose high- performance persistent KV store 5.8 million puts/sec, 8 million gets/sec More comparisons with other systems in paper Using cache-craftiness improves performance by 1.5X

  38. Thank you!

More Related Content