Synchronization and Scaling Challenges
Study the impact of synchronization on scaling, exploring hardware artifacts and higher-level primitives. Learn how synchronization ensures coordination, the hindrance to scalability, and the overheads that arise. Understand basic hardware artifacts like CAS, TAS, FAI, and synchronization primitives such as Mutex, Semaphores, Spin locks.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Everything you still dont know about synchronization and how it scales Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel 2/21/2025 1
Motivation Why need synchronization? Many core architecture a common phenomenon Challenging to scale systems Synchronization Crucial to ensure coordination and correctness Hindrance to scalability Synchronization ensured using primitives Rely on hardware artifacts sometimes gory details of h/w not known Hard to predict if applications will scale w.r.t a specific synchronization scheme 2/21/2025 2
What are we trying to study? Study synchronization under scaling How various hardware artifacts scale? How the higher level synchronization primitives scale? Does hardware architecture impact the performance? What are the overheads that pop up while scaling? 2/21/2025 3
How do you synchronize? Basic hardware artifacts CAS, TAS, FAI and other atomic instructions Mutex, Semaphores, Spin locks, Condition Variables, Barrier Different purpose Structure shared by all threads using it Use the above hardware artifacts to update the shared structure atomically 2/21/2025 4
Synchronization Primitives 101 Basic hardware artifacts CAS - Uses lock cmpxchg TAS - Uses xchg FAI - Uses lock xadd Higher level synchronization primitives Mutex Used to ensure mutual exclusion, Ownership crucial lock/unlock Semaphore Signaling mechanism, Ownership not important wait/post(signal) Spinlock Locking mechanism, generally used for smaller critical section Futex Used for performance avoid syscalls to acquire locks Syscall done only when contention 2/21/2025 5
Experiments Parameters Different configurations (Intra socket, Inter socket, Hyperthreading) Thread scaling (1, 2, 4, 8, 14, 28, 56) 28, 56 not done for Intra socket 56 not done for Intra socket & Inter socket Vary Critical section (CS) size Pseudo-code for CS: FOR (0 LOOP_COUNT) { count := count + 1; (count is volatile) } Experiments done for LOOP_COUNT 100, 1000, 10000 Layered study Basic Hardware Artifacts CAS, FAI, TAS Higher level synchronization primitives (musl library) Mutex, Semaphore, Spinlocks 2/21/2025 6
Platform/Architecture Intel Xeon E5 v3 (Haswell EP) 2 sockets, 14physical active cores per socket(possibly using a 18 core die) Hyperthreaded, 2 threads per core 8 cores, 8 L3 slices, 1 memory controller connected to 1 bi-directional ring. Remaining are connected to another bi-directional ring Ring topology hidden from OS in default configuration COD splits the processors into 2 clusters, topology now has 4 NUMA nodes(but we are seeing only 2 NUMA nodes. Enabling COD also doesn t show 4 NUMA nodes ) Cache Coherence Mechanisms MESIF implementation Implemented by Caching Agents(CAs) within L3 slice and Home Agents(HAs) with memory controller. Modes Source Mode(enabled by default) Home Mode Cluster-on-Die Mode 2/21/2025 7
How Do Atomic Instructions Scale? How do atomic instructions scale with varying contention? Does placement of threads affect scaling? Single Socket HyperThreading or not Two Sockets HyperThreading or not How do different atomic instructions vary in latency? Locks are implemented using these atomics Spin locks, Mutex, Semaphores use CAS Does the coherence state of the cache line affect latencies of operations? 2/21/2025 8
Atomics Latency trends with increasing threads 2/21/2025 9
Effect of Thread Placement on Latencies 2/21/2025 10
Effect of Thread Placement on a Single CAS 2/21/2025 11
Insights Latencies of all instructions increas linearly with increasing contention Threads placed on HyperThreaded cores provide improved performance Effects of HyperThreading are more pronounced when threads are on different sockets CAS latency can be very large if threads are placed across sockets (2x more!) Significant because CAS used widely for implementation of locks (spin locks, mutex) - More to be covered in subsequent slides! 2/21/2025 12
Spinlocks, Mutex and Binary Semaphores What should I use if my critical section is small? Does number of threads in my application has matter? Does it thread placement matter? What is the worst & best performance I can get? 2/21/2025 13
Binary Semaphore/Mutex Behavior as Critical Section Size Changes Spinlocks usually used when critical section small Binary Semaphore/ Mutex? See for yourself! 2/21/2025 14
NHT2s_100 2/21/2025
NHT2s_100 2/21/2025
NHT2s_100 2/21/2025
NHT2s_100 2/21/2025
NHT2s_100 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_10000 2/21/2025
NHT2s_1000 2/21/2025
NHT2s_1000 2/21/2025
NHT2s_1000 2/21/2025
NHT2s_1000 2/21/2025
NHT2s_1000 2/21/2025
General behavior as Critical Section size changes We looked at what happens when 14 threads try contending at once. When CS large, everyone calls syscall once. How are the threads woken up? FCFS! When CS small, no contention. CS over before other threads even scheduled! When CS size intermediate, some threads call syscall more than once. Since CS is not big enough, some threads weren t even scheduled yet, and start contending with the thread just woken up. FCFS woken up does not imply FCFS entering CS. 2/21/2025 33
Spinlocks scaling as number of threads vary? Max Latency - Spinlocks Critical Section - 100 loop count 450 Spinlocks mostly used with less CS size How does its performance vary with number of threads? Does not scale well, even if CS small Actually worse than mutex, and binary semaphore Why? Back off in mutex and semaphore More later! NHT1s_100 NHT2s_100 HT_100 400 350 Max Iterations 300 250 200 150 100 50 0 1 2 4 8 14 28 56 No. of threads Max Latency - Spinlocks Critical Section - 1000 loop count 14000 NHT1s_1000 NHT2s_1000 HT_1000 12000 10000 Max Iterations 8000 6000 4000 2000 0 1 2 4 8 14 28 56 2/21/2025 34 No. of Threads
How does mutex/semaphore scale with number of threads? 2/21/2025 35
How does mutex/semaphore scale with number of threads? 2/21/2025 36
Why dont they scale well? Whats going on inside? Mutex 1. Try CAS to get the lock 2. Fail? Try CAS again to get the lock 3. Fail? Spin for some time if there are other waiters too! 4. Try CAS on the lock again 5. Fail? 1. Register yourself as a waiter 2. Syscall to Futex 2/21/2025 37
Why dont they scale well? Whats going on inside? Semaphore 1. Check the semaphore to see if you can enter the critical section 2. Fail? Try CAS in a loop until you successfully update semaphore 3. Fail? Spin for some time if there are other waiters too! 4. Try CAS to update the semaphore again. 5. Fail? Register yourself as waiter. CAS to update the semaphore Syscall to futex 2/21/2025 38
Why dont they scale well? Whats going on inside? Mutex lock overhead breakup for 14 threads (HT Intrasocket) Mutex lock overhead breakup for 14 threads (IntraSocket No HT) 2500 1800 1600 2000 1400 1200 Latency (in ns) 1500 Latency (in ns) 1000 800 1000 600 400 500 200 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Thread ID Thread ID 1st_CAS 2nd_CAS trylock <- 3rd CAS spin_time while loop trying trylock 1st_CAS 2nd_CAS trylock <- 3rd CAS spin_time while loop trying trylock 2/21/2025 39
Why dont they scale well? Whats going on inside? Mutex lock overhead breakup for 14 threads (Inter-socket No HT) 4500 4000 3500 3000 Latency (in ns) 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Thread ID 1st_CAS 2nd_CAS trylock <- 3rd CAS spin_time while loop trying trylock 2/21/2025 40
How does the behavior change as thread placement varies? (For 14 threads) 1st_CAS 2nd_CAS 3rd_CAS while loop try_lock Syscall Spin didn t_complete No_of_syscalls NHT1s_1000 21.43 0 0 0 78.57 1 7-1/4-2 NHT2s_1000 14.29 0 0 7.14 78.57 0 4-1/6-2/1-3 HT_1000 21.43 0 0 14.29 64.29 3 6-1/3-2 NHT1s_100 85.71 0 0 14.29 0 2 0 NHT2s_100 42.86 0 14.29 42.86 0 2 0 HT_100 78.57 7.14 0 14.29 0 0 0 Mutex: % of threads completing at each stage Semaphore has the same behavior It might make sense to not spin in mutex/semaphore when critical section large Most threads block during syscall even after spinning 2/21/2025 41
How does the behavior change as thread placement varies? Config CS size Max spin count NHT1s 1000 3719 NHT2s 1000 4275 HT1s 1000 3242 NHT1s 100 50 NHT2s 100 151 HT 100 28 Spinlock: spin count variation with thread placement 2/21/2025 42
Variation of max and min overheads as number of threads vary Semaphore wait latency for Critical section size 1000 2/21/2025 43
Variation of max and min overheads as number of threads vary Semaphore wait latency for Critical section size 1000 2/21/2025 44
Variation of max and min overhead as number of threads vary For 14 threads, semaphore wait latency across sockets is worse For smaller number of threads, inter-socket wait latency can be smaller Timeline shows threads across sockets are scheduled late, resulting in lesser contention If threads on hyperthreaded core, wait latency can be smaller Compared to wait latency of threads on non-hyperthreaded cores The behavior of mutex is similar to that of semaphore 2/21/2025 45
How do mutexes, spinlocks and binary sems compare with each other? Mutexes and binary semaphores have similar latency Critical Section 100 LOOPCOUNT 2/21/2025 46
How do mutexes, spinlocks and binary sems compare with each other? Do not use spin locks if there are lots of threads for small CS Critical Section 100 LOOPCOUNT 2/21/2025 47
What about semaphore post & mutex unlock? Post/unlock s latency increases linearly with scale 2/21/2025 48
Other Observations Locks are given to threads in a cluster format Inter socket experiments, the locks are acquired by threads belonging to the same socket (3-4 threads in one go) 2/21/2025 49
Conclusion Synchronization is hard Basic hardware artifacts are closely tied with the software synchronization primitives Inter-socket performance is usually worse than same socket and hyper-threading To get the best performance from software, you should know everything about the architecture But if you have a Haswell EP machine, use our slides 50