Enhancing Key-Value Storage with MemC3 and Cuckoo Hashing
MemC3 is a specialized key-value store that combines CLOCK and Concurrent Cuckoo Hashing to improve performance and efficiency. Memcached, an established DRAM-based key-value store, is also discussed along with its LRU eviction strategy. The use of internal chaining hashtable and LRU caching in Memcached is explored, highlighting space overhead issues. The goals of reducing space overhead and improving throughput in a read-intensive workload are achieved with a 3X throughput increase and 30% more objects stored. Additionally, Cuckoo Hashing is detailed as a method using two hash functions to optimize key placements.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU) Dave Andersen (CMU), Michael Kaminsky (Intel Labs) Presenter: Shouqian Shi Credit to: Son Nguyen
Memcached DRAM-based key- value store to alleviate database load Set(key, value) Get(key) -> value
Memcached LRU (Least Recently Used) eviction Often used for small objects (Facebook [Atikoglu12] ) 90% keys < 31 bytes 10xM queries per second (Facebook [Atikoglu12] )
Memcached internal Chaining hashtable LRU caching using doubly linked list Global Lock Huge Space Overhead
Goals Target: read-intensive workload with small objects Reduce space overhead (bytes/key) Improve throughput (queries/sec) Result: 3X throughput, 30% more objects
MemCachd: Chaining Hashtable Use linked list costly space overhead for pointers Pointer dereference is slow (no advantage from CPU cache) Read is not constant time (due to possibly long list)
Cuckoo Hashing Use 2 hash functions to pick two candidate positions HASH1(ka) Each bucket has exactly 4 slots (fits in CPU cache) (ka,va) Each (key, value) object therefore can reside at one of the 8 possible slots HASH2(ka)
Cuckoo Hashing X b a X X X Insert a: HASH1(ka) X X X X X (ka,va) X X X HASH2(ka) X c X X X X X X X X X X
Cuckoo Hashing X X X a X Insert b: HASH2(kb) X X X X X (kb,vb) X X X c b X X X HASH1(kb) X X X X X X X X
Cuckoo Hashing X X X a X Insert c: X HASH1(kc) X X c X X (kc,vc) X X X X b X X HASH2(kc) X X X X X X X Done !!! X
Cuckoo Hashing Read: 4 lookups on average Write: write(ka, va) Find an empty slot in 8 possible slots of ka If all are full then randomly kick some (kb, vb) out Now find an empty slot for (kb, vb) Repeat 500 times or until an empty slot is found If still not found then do table expansion
Cuckoos advantages Concurrency: multiple readers/single writer Read optimized (entries fit in CPU cache) Still O(1) amortized time for write 30% less space overhead 95% table occupancy
Floating Problem Always one guy is outside during the insertion false cache miss Solution: Compute the kick out path (Cuckoo path) first, then move items backward
Computed Cuckoo path X X X b X Insert a: HASH1(ka) X X X X X (ka,va) X X X HASH2(ka) X c X X X X X X X X X X
Cuckoo path backward insert X b a X X X Insert a: HASH1(ka) X X X X X (ka,va) X X X HASH2(ka) X X X c X X X X X X X X
MemCachd: Doubly-linked-list At least two pointers per item expensive for small key-value pair Both read and write update the position change the list s structure need locking between threads (no concurrency)
Solution: CLOCK-based LRU Only for multiple readers / single writer Approximate LRU Circular queue instead of linked list less space overhead 1 bit per entry vs 16 Bytes
CLOCK example entry (ka, va) (kb, vb) (kc, vc) (kd, vd) (ke, ve) Originally: recency 0 1 0 0 1 entry (ka, va) (kb, vb) (kc, vc) (kd, vd) (ke, ve) Read(kd): recency 0 1 0 1 1 entry (ka, va) (kb, vb) (kf, vf) (kd, vd) (ke, ve) Write(kf, vf): recency 0 0 1 1 1 entry (kg, vg) (kb, vb) (kf, vf) (kd, vd) (ke, ve) Write(kg, vg): recency 1 0 1 0 0
Evaluation 68% throughput improvement in all hit case. 235% for all miss
Evaluation 3x throughput on real workload
Discussion Single machine multi-core optimization no cooperation between machines need a load balancer cannot address hotspot on the cluster level Atomic insertion lock along the path The impact of hotspot false eviction?