Lightweight Synchronization Mechanism for Concurrent Programming
This paper introduces a lightweight synchronization mechanism called Read-Log-Update (RLU), an extension to Read-Copy-Update (RCU). RLU simplifies programming with RCU by providing support for unsynchronized traversals and multi-location atomic updates within a single framework. It aims to preserve RCU's efficiency while offering enhanced functionality for concurrent programming, especially in scenarios requiring complex data structures.
- Lightweight Synchronization
- Concurrent Programming
- Read-Log-Update
- RLU
- Multi-location Atomic Updates
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev (MIT) Nir Shavit (MIT and TAU) Pascal Felber (UNINE) Patrick Marlier (UNINE)
Multicore Revolution Need concurrent data-structures New programming frameworks for concurrency
The Key to Performance in Concurrent Data-Structures Unsynchronized traversals: sequences of reads without locks, memory fences or writes 90% of the time is spent traversing data Multi-location atomic updates Hide race conditions from programmers
RCU Read-Copy-Update (RCU), introduced by McKenney, is a programming framework that provides built-in support for unsynchronized traversals
RCU Pros: Very efficient (no overhead for readers) Popular, Linux kernel has 6,500+ RCU calls Cons: Hard to program (in non-trivial cases) Allows only single pointer updates Supports unsynchronized traversals but not multi-location atomic updates
This Paper RLU Read-Log-Update (RLU), an extension to RCU that provides both unsynchronized traversals and multi-location atomic updates within a single framework Key benefit: Simplifies RCU programming Key challenge: Preserves RCU efficiency
RCU Overview Key Idea 1. To modify objects: Duplicate them and modify copies Provides unsynchronized traversals 2. To commit: Use a single pointer update to make new copies reachable and old copies unreachable Must happen all at once!
RCU Key Idea (1) Duplicate C (2) Single pointer update: make C reachable and C unreachable Update(C) P P P P P C C A B C D Writer-Lock Q Q Q Q How to deallocate C? Lookup(C) 8
How to free objects? RCU-Epoch: a time interval after which it is safe to deallocate objects Waits for all current read operations to finish RCU-Duplication + RCU-Epoch provide: Unsynchronized traversals AND Memory reclamation This makes RCU efficient and practical But, RCU allows only single pointer updates
The Problem RCU Single Pointer Updates Update(even nodes) P B D A B C D E Q Q Q Q Lookup(even nodes) Q sees B but not D : an inconsistent mix
RCU is Complex Applying RCU beyond a linked list is worth a paper in a top conference: RCU resizable hash tables (Triplett, McKenney, Walpole => USENIX ATC-11) RCU balanced trees (Clements, Kaashoek, Zeldovich => ASPLOS-12) RCU citrus trees (Arbel, Attiya => PODC-14, Arbel, Morrison => PPoPP-15)
Our Work Read-Log-Update (RLU), an extension to RCU that adds support for multi- pointer atomic updates Key Idea: Use a global clock + per thread logs
RLU Clocks and Logs A log/buffer to store copies (per-thread) Global Clock (22) Write Clock () P Used on commit Log B D RLU header A B C D E Read on start Q Local Clock (22)
RLU Commit Phase 1 1. P updates clocks 2. P executes RCU-epoch Waits for Q to finish Global Clock Write Clock Global Clock (22) (23) Write Clock () (23) P Steal copy when: Local Clock >= Write Clock B C D A B C D E Z Q Local Clock (23) Local Clock (22) Q will read only old objects Z will read only new objects
RLU Commit Phase 2 3. P writes back log 4. P resets write clock 5. P swaps logs (current log is safe for re-use after next commit) Write Clock (23) () Write Clock Global Clock (23) P B B C D D A B B C D D E
RLU Programming RLU API extends the RCU API: rcu_dereference(..) / rlu_dereference(..) rcu_assign_pointer(..) / rlu_assign_pointer(..) RLU adds a new call: rlu_try_lock(..) To modify object => Lock it Provides multi-location atomic updates Hides object duplications and manipulations
Programming Example List Delete with a Mutex void RLU_list_delete(list_t *list, int val) { spin_lock(&writer_lock); rlu_reader_lock(); prev = rlu_dereference(list->head); curr = rlu_dereference(prev->next); while (curr->val < val) { prev = curr; curr = rlu_dereference(prev->next); } next = rlu_dereference(curr->next); rlu_try_lock(&prev) rlu_assign_ptr(&(prev->next) , next); rlu_free(curr); rlu_reader_unlock(); spin_lock(&writer_lock); } How can we eliminate the mutex? Acquire mutex and start Find node Delete node Finish and release mutex
Locking prev and curr is not enough: Thread Q may delete or insert new nodes concurrently RCU + Fine-Grained Locks Insert(D) P P P P P A B C E Q Q Q Q Programmers need to add custom post-lock validations. In this case, we need: Delete(C) (1) C.next == E (2) C is reachable from the head 18
List Delete without a Mutex void RCU_list_delete(list_t *list, int val) { restart: rcu_reader_lock(); find prev and curr if (!try_lock(prev) || !try_lock(curr)) { rcu_reader_unlock(); goto restart; } // Validate prev and curr if ((curr->is_invalid == 1) || (prev->is_invalid == 1) || (rcu_dereference(prev->next) != curr)) { rcu_reader_unlock(); goto restart; } next = rcu_dereference(curr->next); rcu_assign_ptr(&(prev->next) , next); curr->is_invalid = 1; memory_fence(); unlock(prev); unlock(curr); rcu_reader_unlock(); rcu_free(curr); } void RLU_list_delete(list_t *list, int val) { restart: rlu_reader_lock(); find prev and curr if (!rlu_try_lock(prev) || !rlu_try_lock(curr)) { rlu_reader_unlock(); goto restart; } next = rlu_dereference(curr->next); rlu_assign_ptr(&(prev->next) , next); rlu_free(curr); rlu_reader_unlock(); } validations Find prev and curr Find prev and curr Lock prev and curr Lock prev and curr Delete curr and finish. No post-lock validations necessary! Custom post-lock Delete curr and finish
Performance RLU is optimized for read-dominated workloads (like RCU): RLU object lock checks are fast because: Locks are co-located with the objects Stealing is usually rare RLU writers are more expensive than RCU writers: Not significant for read-dominated workloads Tested in userspace and kernel
Userspace Hash Table and Linked-List (Kernel is similar) Hash-Table 1000 buckets, 100 each 2% updates Throughput Hash-Table 1000 buckets, 100 each 20% updates Throughput 150 150 RCU RCU Total Ops (in millions) Total Ops (in millions) 100 100 RLU RLU 50 50 0 0 1 2 4 6 8 10 12 14 16 Threads 1 2 4 6 8 10 12 14 16 Threads Linked-List 1000 nodes 2% updates Throughput Linked-List 1000 nodes 20% updates Throughput 80 50 RCU RCU Total Ops (in millions) Total Ops (in millions) 40 60 30 RLU RLU 40 20 20 10 0 0 1 2 4 6 8 10 12 14 16 Threads 1 2 4 6 8 10 12 14 16 Threads
Applying RLU to Kyoto CacheDB Kyoto CacheDB uses: A reader-writer lock A per slot lock (DB is broken into slots) The reader-writer lock is a serial bottleneck Use RLU to eliminate this lock It was easy to apply: Use slot locks to serialize writers to the same slot Simply lock each object before modification
RLU and Original Kyoto CacheDB Kyoto Cabinet Cache DB - 2% mutation ratio - Size: 1.0 GB - Throughput Orig 2.00E+07 1.80E+07 Total Operations [10 secs] Orig-Fixed 1.60E+07 1.40E+07 RLU 1.20E+07 1.00E+07 8.00E+06 6.00E+06 4.00E+06 2.00E+06 0.00E+00 1 2 4 6 8 10 12 14 16 Threads Kyoto Cabinet Cache DB - 10% mutation ratio - Size: 1.0 GB - Throughput Orig 1.60E+07 1.40E+07 Total Operations [10 secs] Orig-Fixed 1.20E+07 1.00E+07 RLU 8.00E+06 6.00E+06 4.00E+06 2.00E+06 0.00E+00 1 2 4 6 8 10 12 14 16 Threads
Conclusion RLU adds multi-pointer atomic updates to RCU while maintaining efficiency both in userspace and kernel Much more in the paper Optimizations (deferral) Benchmarks (kernel, Citrus, resizable hash table) RLU is available as open source (MIT license): https://github.com/rlu-sync
Appendix 1. RLU-Defer 2. Kernel Tests 3. RCU vs RLU resizable hash table
RLU-Defer RLU writers are slower since they need to execute wait-for-readers. RLU-Defer reduces these costs (by 10x). Note that wait-for-readers write-backs and unlocks objects. But unlocking is only needed for a write-write conflict, so RLU-Defer executes wait-for-readers only when a write-write conflict occurs.
RLU-Defer RLU-Defer is significant for many threads 100,000 Nodes Citrus Tree Throughput RCU-Citrus-10 7.00E+08 6.00E+08 RCU-Citrus-20 5.00E+08 Total Operations RCU-Citrus-40 4.00E+08 3.00E+08 RLU-Defer-Citrus-10 2.00E+08 1.00E+08 RLU-Defer-Citrus-20 0.00E+00 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 number of threads RLU-Defer-Citrus-40
Resizable Hash Table Code Comparison int rcu_hash_list_remove(volatile hash_list_t **p_p_hash_list, val_t val) { int res; int bin; int is_zip_locked; node_t *p_node; node_t *p_prev; list_t *p_list; list_t *p_list_2; volatile node_t *p_unzip_node; volatile hash_list_t *p_hash_list; retry_lock: RCU_READER_LOCK(); p_hash_list = *p_p_hash_list; bin = HASH_VALUE(p_hash_list, val); p_list = p_hash_list->buckets[bin].p_list; if (!try_lock_list_no_spin(p_list)) { RCU_READER_UNLOCK(); PAUSE(); goto retry_lock; } is_zip_locked = 0; if (p_list->is_zipped) { p_list_2 = p_hash_list->buckets[GET_PAIR_BUCKET(p_hash_list, bin)].p_list; if (!try_lock_list_no_spin(p_list_2)) { unlock_list(p_list); RCU_READER_UNLOCK(); PAUSE(); goto retry_lock; } is_zip_locked = 1; } p_node = rcu_list_find(p_list, val, &p_prev); res = (p_node != NULL) && (p_node->val == val); if (!res) { goto finish; } if (p_prev == NULL) { RCU_ASSIGN_PTR(&(p_list->p_head), p_node->p_next); if (p_list->is_zipped) { if (p_list->p_unzip_node == NULL) { p_list_2->p_head = p_node->p_next; } else if (p_list->p_unzip_node == p_node) { p_list->p_unzip_node = NULL; p_list_2->p_unzip_node = NULL; p_list->p_head = p_list_2->p_head; } else if (p_list->p_unzip_node->p_next == p_node) { p_list->p_unzip_node->p_next = p_node->p_next; p_list->p_unzip_node = NULL; p_list_2->p_unzip_node = NULL; p_list->p_head = p_list_2->p_head; } } } else { RCU_ASSIGN_PTR(&(p_prev->p_next), p_node->p_next); if (p_list->is_zipped) { if (p_list->p_unzip_node != NULL) { if (p_list->p_unzip_node == p_node) { p_list->p_unzip_node = p_prev; p_list_2->p_unzip_node = p_prev; } else if (p_list->p_unzip_node->p_next == p_node) { p_list->p_unzip_node->p_next = p_node->p_next; if (p_node->p_next != NULL) { HASH_VALUE(p_hash_list, p_list->p_unzip_node->p_next->val)) { } else { } } } } } finish: unlock_list(p_list); if (is_zip_locked) { unlock_list(p_list_2); } RCU_READER_UNLOCK(); return res; } int rlu_hash_list_remove(rlu_thread_data_t *self, volatile hash_list_t **p_p_hash_list, val_t val) { int res; int bin; list_t *p_list; node_t *p_prev; node_t *p_node; volatile hash_list_t *p_hash_list; RLU_START(self); p_hash_list = *p_p_hash_list; bin = HASH_VALUE(p_hash_list, val); RLU_WRITER_LOCK(self, bin); p_list = RLU_DEREF_LIST(self, p_hash_list->buckets[bin].p_list); p_node = rlu_list_find(self, p_list, val, &p_prev); res = (p_node != NULL) && (p_node->val == val); if (result) { if (p_prev == NULL) { p_list = (list_t *)RLU_LOCK(self, p_list); RLU_ASSIGN_PTR(self, &(p_list->p_head), p_node->p_next); } else { p_prev = RLU_LOCK(self, p_prev); RLU_ASSIGN_PTR(self, &(p_prev->p_next), p_node->p_next); } RLU_FREE(self, p_node); } RLU_WRITER_UNLOCK(self); RLU_FINISH(self); return res; } if (HASH_VALUE(p_hash_list, p_list->p_unzip_node } p_list->p_unzip_node = p_prev; p_list_2->p_unzip_node = p_prev; p_list->p_unzip_node = NULL; p_list_2->p_unzip_node = NULL; p_list->is_zipped = 0; p_list_2->is_zipped = 0;
Resizable Hash Table Performance Resizable Hash Table: 2^16 total items 2^13 <-> 2^14 buckets Throughput RCU 8K 1.20E+09 1.00E+09 RCU 16K 8.00E+08 Total Ops RCU 8K-16K 6.00E+08 RLU 8K 4.00E+08 2.00E+08 RLU 16K 0.00E+00 1 3 5 7 9 11 13 15 RLU 8K-16K