Enhancing MemC3: Compact and Concurrent MemCache for Improved Performance

Slide Note
Embed
Share

MemC3 introduces a novel approach to compact and concurrent caching through dumber caching and smarter hashing techniques, addressing key issues faced by traditional memory caching systems. By implementing CLOCK-based LRU, approximate LRU, and utilizing Cuckoo Hashing, MemC3 achieves significant improvements in space overhead reduction and throughput for read-intensive workloads with small objects.


Uploaded on Sep 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing Bin Fan, David G. Andersen, Michael Kaminsky Presenter: Son Nguyen

  2. Memcached internal Memcached: Core Data Structures LRU caching using chaining Hashtable and doubly linked list Chaining hash table Hash table w/ chaining LRU Eviction: Doubly-linked lists Key-Value Index: Doubly-linked-list (for each slab) L R U h e a d e r KV KV KV KV KV http://www.pdl.cmu.edu/ 7 Bin Fan @ NSDI 2013 !

  3. Goals Reduce space overhead (bytes/key) Improve throughput (queries/sec) Target read-intensive workload with small objects Result: 3X throughput, 30% more objects

  4. Doubly-linked-lists problems At least two pointers per item -> expensive Both read and write change the list s structure -> need locking between threads (no concurrency)

  5. Solution: CLOCK-based LRU Approximate LRU Multiple readers/single writer Circular queue instead of linked list -> less space overhead

  6. CLOCK example entry (ka, va) (kb, (kc, vc) (kd, vd) (ke, ve) Originally: vb) recency 1 0 1 1 0 entry (ka, va) (kb, (kc, vc) (kd, vd) (ke, ve) vb) Read(kd): recency 1 0 1 0 0 entry (ka, va) (kb, (kf, vf) (kd, vd) (ke, ve) Write(kf, vf): vb) recency 1 1 0 0 0 entry (kg, vg) (kb, (kf, vf) (kd, vd) (ke, ve) Write(kg, vg): vb) recency 0 1 0 1 1

  7. Chaining Hashtables problems Use linked list -> costly space overhead for pointers Pointer dereference is slow (no advantage from CPU cache) Read is not constant time (due to possibly long list)

  8. Solution: Cuckoo Hashing Use 2 hashtables Each bucket has exactly 4 slots (fits in CPU cache) Each (key, value) object therefore can reside at one of the 8 possible slots

  9. Cuckoo Hashing HASH1(ka) (ka,va) HASH2(ka)

  10. Cuckoo Hashing Read: always 8 lookups (constant, fast) Write: write(ka, va) Find an empty slot in 8 possible slots of ka If all are full then randomly kick some (kb, vb) out Now find an empty slot for (kb, vb) Repeat 500 times or until an empty slot is found If still not found then do table expansion

  11. Cuckoo Hashing X b a X X X Insert a: HASH1(ka) X X X X X (ka,va) HASH2(ka) X X X X c X X X X X X X X X X

  12. Cuckoo Hashing X X X a X Insert b: HASH1(kb) X X X X X (kb,vb) X X X c b X X X HASH2(kb) X X X X X X X X

  13. Cuckoo Hashing X X X a X Insert c: X HASH1(kc) X X c X X (kc,vc) X X X X b X X HASH2(kc) X X X X X X X Done !!! X

  14. Cuckoo Hashing Problem: after (kb, vb) is kicked out, a reader might attempt to read (kb, vb) and get a false cache miss Solution: Compute the kick out path (Cuckoo path) first, then move items backward Before: (b,c,Null)->(a,c,Null)->(a,b,Null)->(a,b,c) Fixed: (b,c,Null)->(b,c,c)->(b,b,c)->(a,b,c)

  15. Cuckoo path X X X b X Insert a: HASH1(ka) X X X X X (ka,va) HASH2(ka) X X X X c X X X X X X X X X X

  16. Cuckoo path backward insert X b a X X X Insert a: HASH1(ka) X X X X X (ka,va) HASH2(ka) X X X X X X c X X X X X X X X

  17. Cuckoos advantages Concurrency: multiple readers/single writer Read optimized (entries fit in CPU cache) Still O(1) amortized time for write 30% less space overhead 95% table occupancy

  18. Evaluation 68% throughput improvement in all hit case. 235% for all miss

  19. Evaluation End-to-end Performance 3x throughput on real workload 16-Byte key, 32-Byte Value, 95% GET, 5% SET, zipfdistributed 50 remoteclients generate workloads max tput 4.3 MOPS 5" 4" MemC3 max tput 1.5 MOPS MQPS% 3" 2" Memcached 1" Sharding 0" 1" 2" 4" 6" 8" 10" 12" 14" 16" Number% of% Server% Threads% % max tput 0.6 MOPS http://www.pdl.cmu.edu/ 30 Bin Fan @ NSDI 2013 !

  20. Discussion Write is slower than chaining Hashtable Chaining Hashtable: 14.38 million keys/sec Cuckoo: 7 million keys/sec Idea: finding cuckoo path in parallel Benchmark doesn t show much improvement Can we make it write-concurrent?

Related


More Related Content