Lightweight Synchronization Mechanism for Concurrent Programming

Read-Log-Update

A Lightweight Synchronization Mechanism for

Concurrent Programming

Alexander Matveev (MIT)

Nir Shavit (MIT and TAU)

Pascal Felber (UNINE)

Patrick Marlier (UNINE)

Multicore Revolution

Need concurrent data-structures

New

programming frameworks for concurrency

The Key to Performance in Concurrent

Data-Structures

•

Unsynchronized traversals

: sequences of

reads without locks, memory fences or

writes

–

90% of the time is spent traversing data

•

Multi-location atomic

updates

–

Hide race conditions from programmers

RCU

•

Read-Copy-Update

 (RCU), introduced

by McKenney, is a programming

framework that provides built-in

support for

unsynchronized

traversals

RCU

•

Pros:

–

Very efficient (no overhead for readers)

–

Popular, Linux kernel has 6,500+ RCU calls

•

Cons:

–

Hard to program (in non-trivial cases)

–

Allows only single pointer updates

Supports

unsynchronized traversals

but not

multi-location atomic updates

This Paper — RLU

•

Read-Log-Update

 (RLU), an extension to

RCU that provides both

unsynchronized

traversals

and

multi-location atomic

updates

within a single framework

–

Key benefit: Simplifies RCU programming

–

Key challenge: Preserves RCU efficiency

RCU Overview

Key Idea

1.

To modify objects:

 Duplicate them and

modify copies



 Provides unsynchronized traversals

2.

To commit:

 Use a single pointer update to

make new copies reachable and old copies

unreachable



 Must happen all at once!

RCU Key Idea

Update(C)

Writer-Lock

C’

Lookup(C)

(1)

Duplicate C

(2) Single pointer

update: make C’

reachable and C

unreachable

C’

How to

deallocate C?

How to free objects?

•

RCU-Epoch:

 a time interval after which it is

safe to deallocate objects

–

Waits for all current read operations to finish

•

RCU-Duplication + RCU-Epoch provide:

–

Unsynchronized traversals

AND

–

Memory reclamation



 This makes RCU efficient and practical



 But,

RCU allows only single pointer updates

Update(even nodes)

Lookup(even nodes)

D’

Q sees B’ but not D’:

an inconsistent mix

B’

The Problem

RCU Single Pointer Updates

RCU is Complex

•

Applying RCU beyond a linked list is worth a

paper in a top conference:

–

RCU resizable hash tables (

Triplett, McKenney,

Walpole => USENIX ATC-11

–

RCU balanced trees (

Clements, Kaashoek,

Zeldovich => ASPLOS-12

–

RCU citrus trees (

Arbel, Attiya => PODC-14, Arbel,

Morrison => PPoPP-15

Our Work

•

Read-Log-Update

 (RLU), an extension

to RCU that

adds support for multi-

pointer atomic updates

•

Key Idea: Use a global clock + per

thread logs

D’

B’

A log/buffer to store

copies (per-thread)

Log

RLU header

Global Clock

(22)

Local Clock

(22)

Write Clock

∞

Read on start

Used on

commit

RLU Clocks and Logs

Write Clock

∞

Global Clock

(22)

C’

D’

B’

1. P updates clocks

2. P executes RCU-epoch



Waits for Q to finish

Global Clock

(23)

Local Clock

(22)

Write Clock

(23)

Steal copy when:

  Local Clock >=

  Write Clock

Local Clock

(23)

Z will read

only new

objects

Q will read

only old

objects

RLU Commit – Phase 1

Global Clock

(23)

Write Clock

(23)

C’

D’

B’

3. P writes back log

4. P resets write clock

5. P swaps logs

 (current log is safe for

  re-use after next commit)

Write Clock

∞

RLU Commit – Phase 2

B’

D’

D’

B’

RLU Programming

•

RLU API extends the RCU API:

–

rcu_dereference(..) / rlu_dereference(..)

–

rcu_assign_pointer(..) / rlu_assign_pointer(..)

–

…

•

RLU adds a new call:

rlu_try_lock

(..)

–

To modify object => Lock it

–

Provides multi-location atomic updates

•

Hides object duplications and manipulations

Programming Example

List Delete with a Mutex

void

RLU_list_delete

(list_t *list, int val)

spin_lock

(&writer_lock);

rlu_reader_lock

();

prev =

rlu_dereference

(list->head);

curr =

rlu_dereference

(prev->next);

while

 (curr->val < val) {

prev = curr;

curr =

rlu_dereference

(prev->next);

next =

rlu_dereference

(curr->next);

rlu_try_lock

(&prev)

rlu_assign_ptr

(&(prev->next) , next);

rlu_free

(curr);

rlu_reader_unlock

();

spin_lock

(&writer_lock);

Acquire mutex

and start

Find node

Delete node

Finish and

release mutex

How can we eliminate the mutex?

RCU + Fine-Grained Locks

Insert(D)

Delete(C)

Locking “prev” and

“curr” is not

enough: Thread Q

may delete or

insert new nodes

concurrently

Programmers need to add custom post-lock validations.

In this case, we need:

(1)

C.next == E

(2)

C is reachable from the head

void

RCU_list_delete

(list_t *list, int val) {

restart:

rcu_reader_lock

();

 … find “prev” and “curr” …

if

(!

try_lock

(prev) ||

try_lock

(curr))

rcu_reader_unlock

();

goto

 restart;

// Validate “prev“ and “curr”

if

 ((curr->is_invalid == 1) ||

        (prev->is_invalid == 1) ||

rcu_dereference

(prev->next) !=

curr))

rcu_reader_unlock

();

goto

 restart;

    next =

rcu_dereference

(curr->next);

rcu_assign_ptr

(&(prev->next) , next);

    curr->is_invalid = 1;

memory_fence

();

unlock

(prev);

unlock

(curr);

rcu_reader_unlock

();

rcu_free

(curr);

void

RLU_list_delete

(list_t *list, int val) {

restart:

rlu_reader_lock

();

    … find “prev” and “curr” …

if

(!

rlu_try_lock

(prev) ||

rlu_try_lock

(curr))

rlu_reader_unlock

();

goto

 restart;

    next =

rlu_dereference

(curr->next);

rlu_assign_ptr

(&(prev->next) , next);

rlu_free

(curr);

rlu_reader_unlock

();

List Delete without a Mutex

Find “prev” and “curr”

Lock “prev” and “curr”

Custom post-lock

validations

Delete “curr” and finish

Find “prev” and “curr”

Lock “prev” and “curr”

 Delete “curr” and finish.

No post-lock validations

necessary!

Performance

•

RLU is optimized for read-dominated

workloads (like RCU):

–

RLU object lock checks are fast because:

•

Locks are co-located with the objects

•

Stealing is usually rare

–

RLU writers are more expensive than RCU writers:

•

Not significant for read-dominated workloads

•

Tested in userspace and kernel

Userspace Hash Table and Linked-List

(Kernel is similar)

Applying RLU to Kyoto CacheDB

•

Kyoto CacheDB uses:

–

A reader-writer lock

–

A per slot lock (DB is broken into slots)



 The reader-writer lock is a serial bottleneck



 Use RLU to eliminate this lock



 It was easy to apply:

–

Use slot locks to serialize writers to the same slot

–

Simply lock each object before modification

RLU and Original Kyoto CacheDB

Conclusion

•

RLU adds multi-pointer atomic updates to RCU

while maintaining efficiency both in userspace

and kernel

•

Much more in the paper

–

Optimizations (deferral)

–

Benchmarks (kernel, Citrus, resizable hash table)

•

RLU is available as open source (MIT license):

https://github.com/rlu-sync

Thank You

Appendix

1.

RLU-Defer

2.

Kernel Tests

3.

RCU vs RLU resizable hash table

RLU-Defer

•

RLU writers are slower since they need to

execute wait-for-readers.

•

RLU-Defer reduces these costs (by 10x).

–

Note that wait-for-readers write-backs and

unlocks objects.

–

But unlocking is only needed for a write-write

conflict, so RLU-Defer executes wait-for-readers

only when a write-write conflict occurs.

RLU-Defer

RLU-Defer is significant for many threads

Kernel Tests

Resizable Hash Table

Code Comparison

Resizable Hash Table

Performance

Slide Note

Embed Share

Download

This paper introduces a lightweight synchronization mechanism called Read-Log-Update (RLU), an extension to Read-Copy-Update (RCU). RLU simplifies programming with RCU by providing support for unsynchronized traversals and multi-location atomic updates within a single framework. It aims to preserve RCU's efficiency while offering enhanced functionality for concurrent programming, especially in scenarios requiring complex data structures.

nova_ncy Follow

Uploaded on Oct 06, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev (MIT) Nir Shavit (MIT and TAU) Pascal Felber (UNINE) Patrick Marlier (UNINE)

Multicore Revolution Need concurrent data-structures New programming frameworks for concurrency

The Key to Performance in Concurrent Data-Structures Unsynchronized traversals: sequences of reads without locks, memory fences or writes 90% of the time is spent traversing data Multi-location atomic updates Hide race conditions from programmers

RCU Read-Copy-Update (RCU), introduced by McKenney, is a programming framework that provides built-in support for unsynchronized traversals

RCU Pros: Very efficient (no overhead for readers) Popular, Linux kernel has 6,500+ RCU calls Cons: Hard to program (in non-trivial cases) Allows only single pointer updates Supports unsynchronized traversals but not multi-location atomic updates

This Paper RLU Read-Log-Update (RLU), an extension to RCU that provides both unsynchronized traversals and multi-location atomic updates within a single framework Key benefit: Simplifies RCU programming Key challenge: Preserves RCU efficiency

RCU Overview Key Idea 1. To modify objects: Duplicate them and modify copies Provides unsynchronized traversals 2. To commit: Use a single pointer update to make new copies reachable and old copies unreachable Must happen all at once!

RCU Key Idea (1) Duplicate C (2) Single pointer update: make C reachable and C unreachable Update(C) P P P P P C C A B C D Writer-Lock Q Q Q Q How to deallocate C? Lookup(C) 8

How to free objects? RCU-Epoch: a time interval after which it is safe to deallocate objects Waits for all current read operations to finish RCU-Duplication + RCU-Epoch provide: Unsynchronized traversals AND Memory reclamation This makes RCU efficient and practical But, RCU allows only single pointer updates

The Problem RCU Single Pointer Updates Update(even nodes) P B D A B C D E Q Q Q Q Lookup(even nodes) Q sees B but not D : an inconsistent mix

RCU is Complex Applying RCU beyond a linked list is worth a paper in a top conference: RCU resizable hash tables (Triplett, McKenney, Walpole => USENIX ATC-11) RCU balanced trees (Clements, Kaashoek, Zeldovich => ASPLOS-12) RCU citrus trees (Arbel, Attiya => PODC-14, Arbel, Morrison => PPoPP-15)

Our Work Read-Log-Update (RLU), an extension to RCU that adds support for multi- pointer atomic updates Key Idea: Use a global clock + per thread logs

RLU Clocks and Logs A log/buffer to store copies (per-thread) Global Clock (22) Write Clock () P Used on commit Log B D RLU header A B C D E Read on start Q Local Clock (22)

RLU Commit Phase 1 1. P updates clocks 2. P executes RCU-epoch Waits for Q to finish Global Clock Write Clock Global Clock (22) (23) Write Clock () (23) P Steal copy when: Local Clock >= Write Clock B C D A B C D E Z Q Local Clock (23) Local Clock (22) Q will read only old objects Z will read only new objects

RLU Commit Phase 2 3. P writes back log 4. P resets write clock 5. P swaps logs (current log is safe for re-use after next commit) Write Clock (23) () Write Clock Global Clock (23) P B B C D D A B B C D D E

RLU Programming RLU API extends the RCU API: rcu_dereference(..) / rlu_dereference(..) rcu_assign_pointer(..) / rlu_assign_pointer(..) RLU adds a new call: rlu_try_lock(..) To modify object => Lock it Provides multi-location atomic updates Hides object duplications and manipulations

Programming Example List Delete with a Mutex void RLU_list_delete(list_t *list, int val) { spin_lock(&writer_lock); rlu_reader_lock(); prev = rlu_dereference(list->head); curr = rlu_dereference(prev->next); while (curr->val < val) { prev = curr; curr = rlu_dereference(prev->next); } next = rlu_dereference(curr->next); rlu_try_lock(&prev) rlu_assign_ptr(&(prev->next) , next); rlu_free(curr); rlu_reader_unlock(); spin_lock(&writer_lock); } How can we eliminate the mutex? Acquire mutex and start Find node Delete node Finish and release mutex

Locking prev and curr is not enough: Thread Q may delete or insert new nodes concurrently RCU + Fine-Grained Locks Insert(D) P P P P P A B C E Q Q Q Q Programmers need to add custom post-lock validations. In this case, we need: Delete(C) (1) C.next == E (2) C is reachable from the head 18

List Delete without a Mutex void RCU_list_delete(list_t *list, int val) { restart: rcu_reader_lock(); find prev and curr if (!try_lock(prev) || !try_lock(curr)) { rcu_reader_unlock(); goto restart; } // Validate prev and curr if ((curr->is_invalid == 1) || (prev->is_invalid == 1) || (rcu_dereference(prev->next) != curr)) { rcu_reader_unlock(); goto restart; } next = rcu_dereference(curr->next); rcu_assign_ptr(&(prev->next) , next); curr->is_invalid = 1; memory_fence(); unlock(prev); unlock(curr); rcu_reader_unlock(); rcu_free(curr); } void RLU_list_delete(list_t *list, int val) { restart: rlu_reader_lock(); find prev and curr if (!rlu_try_lock(prev) || !rlu_try_lock(curr)) { rlu_reader_unlock(); goto restart; } next = rlu_dereference(curr->next); rlu_assign_ptr(&(prev->next) , next); rlu_free(curr); rlu_reader_unlock(); } validations Find prev and curr Find prev and curr Lock prev and curr Lock prev and curr Delete curr and finish. No post-lock validations necessary! Custom post-lock Delete curr and finish

Performance RLU is optimized for read-dominated workloads (like RCU): RLU object lock checks are fast because: Locks are co-located with the objects Stealing is usually rare RLU writers are more expensive than RCU writers: Not significant for read-dominated workloads Tested in userspace and kernel

Userspace Hash Table and Linked-List (Kernel is similar) Hash-Table 1000 buckets, 100 each 2% updates Throughput Hash-Table 1000 buckets, 100 each 20% updates Throughput 150 150 RCU RCU Total Ops (in millions) Total Ops (in millions) 100 100 RLU RLU 50 50 0 0 1 2 4 6 8 10 12 14 16 Threads 1 2 4 6 8 10 12 14 16 Threads Linked-List 1000 nodes 2% updates Throughput Linked-List 1000 nodes 20% updates Throughput 80 50 RCU RCU Total Ops (in millions) Total Ops (in millions) 40 60 30 RLU RLU 40 20 20 10 0 0 1 2 4 6 8 10 12 14 16 Threads 1 2 4 6 8 10 12 14 16 Threads

Applying RLU to Kyoto CacheDB Kyoto CacheDB uses: A reader-writer lock A per slot lock (DB is broken into slots) The reader-writer lock is a serial bottleneck Use RLU to eliminate this lock It was easy to apply: Use slot locks to serialize writers to the same slot Simply lock each object before modification

RLU and Original Kyoto CacheDB Kyoto Cabinet Cache DB - 2% mutation ratio - Size: 1.0 GB - Throughput Orig 2.00E+07 1.80E+07 Total Operations [10 secs] Orig-Fixed 1.60E+07 1.40E+07 RLU 1.20E+07 1.00E+07 8.00E+06 6.00E+06 4.00E+06 2.00E+06 0.00E+00 1 2 4 6 8 10 12 14 16 Threads Kyoto Cabinet Cache DB - 10% mutation ratio - Size: 1.0 GB - Throughput Orig 1.60E+07 1.40E+07 Total Operations [10 secs] Orig-Fixed 1.20E+07 1.00E+07 RLU 8.00E+06 6.00E+06 4.00E+06 2.00E+06 0.00E+00 1 2 4 6 8 10 12 14 16 Threads

Conclusion RLU adds multi-pointer atomic updates to RCU while maintaining efficiency both in userspace and kernel Much more in the paper Optimizations (deferral) Benchmarks (kernel, Citrus, resizable hash table) RLU is available as open source (MIT license): https://github.com/rlu-sync

Thank You

Appendix 1. RLU-Defer 2. Kernel Tests 3. RCU vs RLU resizable hash table

RLU-Defer RLU writers are slower since they need to execute wait-for-readers. RLU-Defer reduces these costs (by 10x). Note that wait-for-readers write-backs and unlocks objects. But unlocking is only needed for a write-write conflict, so RLU-Defer executes wait-for-readers only when a write-write conflict occurs.

RLU-Defer RLU-Defer is significant for many threads 100,000 Nodes Citrus Tree Throughput RCU-Citrus-10 7.00E+08 6.00E+08 RCU-Citrus-20 5.00E+08 Total Operations RCU-Citrus-40 4.00E+08 3.00E+08 RLU-Defer-Citrus-10 2.00E+08 1.00E+08 RLU-Defer-Citrus-20 0.00E+00 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 number of threads RLU-Defer-Citrus-40

Kernel Tests

Resizable Hash Table Code Comparison int rcu_hash_list_remove(volatile hash_list_t **p_p_hash_list, val_t val) { int res; int bin; int is_zip_locked; node_t *p_node; node_t *p_prev; list_t *p_list; list_t *p_list_2; volatile node_t *p_unzip_node; volatile hash_list_t *p_hash_list; retry_lock: RCU_READER_LOCK(); p_hash_list = *p_p_hash_list; bin = HASH_VALUE(p_hash_list, val); p_list = p_hash_list->buckets[bin].p_list; if (!try_lock_list_no_spin(p_list)) { RCU_READER_UNLOCK(); PAUSE(); goto retry_lock; } is_zip_locked = 0; if (p_list->is_zipped) { p_list_2 = p_hash_list->buckets[GET_PAIR_BUCKET(p_hash_list, bin)].p_list; if (!try_lock_list_no_spin(p_list_2)) { unlock_list(p_list); RCU_READER_UNLOCK(); PAUSE(); goto retry_lock; } is_zip_locked = 1; } p_node = rcu_list_find(p_list, val, &p_prev); res = (p_node != NULL) && (p_node->val == val); if (!res) { goto finish; } if (p_prev == NULL) { RCU_ASSIGN_PTR(&(p_list->p_head), p_node->p_next); if (p_list->is_zipped) { if (p_list->p_unzip_node == NULL) { p_list_2->p_head = p_node->p_next; } else if (p_list->p_unzip_node == p_node) { p_list->p_unzip_node = NULL; p_list_2->p_unzip_node = NULL; p_list->p_head = p_list_2->p_head; } else if (p_list->p_unzip_node->p_next == p_node) { p_list->p_unzip_node->p_next = p_node->p_next; p_list->p_unzip_node = NULL; p_list_2->p_unzip_node = NULL; p_list->p_head = p_list_2->p_head; } } } else { RCU_ASSIGN_PTR(&(p_prev->p_next), p_node->p_next); if (p_list->is_zipped) { if (p_list->p_unzip_node != NULL) { if (p_list->p_unzip_node == p_node) { p_list->p_unzip_node = p_prev; p_list_2->p_unzip_node = p_prev; } else if (p_list->p_unzip_node->p_next == p_node) { p_list->p_unzip_node->p_next = p_node->p_next; if (p_node->p_next != NULL) { HASH_VALUE(p_hash_list, p_list->p_unzip_node->p_next->val)) { } else { } } } } } finish: unlock_list(p_list); if (is_zip_locked) { unlock_list(p_list_2); } RCU_READER_UNLOCK(); return res; } int rlu_hash_list_remove(rlu_thread_data_t *self, volatile hash_list_t **p_p_hash_list, val_t val) { int res; int bin; list_t *p_list; node_t *p_prev; node_t *p_node; volatile hash_list_t *p_hash_list; RLU_START(self); p_hash_list = *p_p_hash_list; bin = HASH_VALUE(p_hash_list, val); RLU_WRITER_LOCK(self, bin); p_list = RLU_DEREF_LIST(self, p_hash_list->buckets[bin].p_list); p_node = rlu_list_find(self, p_list, val, &p_prev); res = (p_node != NULL) && (p_node->val == val); if (result) { if (p_prev == NULL) { p_list = (list_t *)RLU_LOCK(self, p_list); RLU_ASSIGN_PTR(self, &(p_list->p_head), p_node->p_next); } else { p_prev = RLU_LOCK(self, p_prev); RLU_ASSIGN_PTR(self, &(p_prev->p_next), p_node->p_next); } RLU_FREE(self, p_node); } RLU_WRITER_UNLOCK(self); RLU_FINISH(self); return res; } if (HASH_VALUE(p_hash_list, p_list->p_unzip_node } p_list->p_unzip_node = p_prev; p_list_2->p_unzip_node = p_prev; p_list->p_unzip_node = NULL; p_list_2->p_unzip_node = NULL; p_list->is_zipped = 0; p_list_2->is_zipped = 0;

Resizable Hash Table Performance Resizable Hash Table: 2^16 total items 2^13 <-> 2^14 buckets Throughput RCU 8K 1.20E+09 1.00E+09 RCU 16K 8.00E+08 Total Ops RCU 8K-16K 6.00E+08 RLU 8K 4.00E+08 2.00E+08 RLU 16K 0.00E+00 1 3 5 7 9 11 13 15 RLU 8K-16K

Lightweight Synchronization Mechanism for Concurrent Programming

Download Presentation

Presentation Transcript

Related

More Related Content