Performance Analysis of Synchronization Methods in Concurrent Data Structures
Explore the impact of synchronization methods on the performance and behavior of concurrent data structures in multithreaded applications. The study involves developing and implementing concurrent data structures, analyzing coarse-grain locking, fine-grain locking, lock-free mechanisms, and assessing their scalability and speed in various programming languages. Key areas of investigation include contention points in queue and hash table data structures. The research aims to provide insights into optimizing synchronization for efficient distributed computing systems.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos ioaniko@chalmers.se Joint work with: D. Cederman, B. Chatterjee, N. Nguyen, M. Papatriantafilou, P. Tsigas
Developing a multithreaded application The boss wants .NET Java is nice Multicores everywhere The client wants speed (C++?) Yiannis Nikolakopoulos ioaniko@chalmers.se 2
Developing a multithreaded application Concurrent Data Structures Then we need Synchronization. The worker threads need to access data Yiannis Nikolakopoulos ioaniko@chalmers.se 3
Implementing Concurrent Data Structures Performance Bottleneck Coarse Grain Locking Test And Set Array Locks Implementation Fine Grain Locking And more! Yiannis Nikolakopoulos ioaniko@chalmers.se 4
Implementing Concurrent Data Structures Coarse Grain Locking Test And Set Runtime System Hardware platform Array Locks Implementation Fine Grain Locking And more! Lock Free Which is the fastest/most scalable? Yiannis Nikolakopoulos ioaniko@chalmers.se 5
Implementing concurrent data structures Yiannis Nikolakopoulos ioaniko@chalmers.se 6
Problem Statement How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures. Yiannis Nikolakopoulos ioaniko@chalmers.se 7
Outline Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 8
Which data structures to study? Represent different levels of contention: Queue - 1 or 2 contention points Hash table - multiple contention points Yiannis Nikolakopoulos ioaniko@chalmers.se 9
How do we choose implementation? Possible criteria: Framework dependencies Programmability Good performance Yiannis Nikolakopoulos ioaniko@chalmers.se 10
Interpreting good Throughput: The more operations completed per time unit the better. Is this enough? Yiannis Nikolakopoulos ioaniko@chalmers.se 11
Non-fairness 12
What to measure? Throughput: Data structure operations completed per time unit. Operations by thread i Average operations per thread ??? ? min(??) ??? ? ???????? ?= ??? , ??? (??) Yiannis Nikolakopoulos ioaniko@chalmers.se 13
Implementation Parameters Programming Environments C++ Java C# (.NET, Mono) TAS, TTAS, Lock-free, Array lock Synchronization Methods PMutex, lock construct, Mutex Reentrant, synchronized Lock-free memory management NUMA Architectures Intel Nehalem, 2 x 6 core (24 HW threads) AMD Bulldozer, 4 x 12 core (48 HW threads) Do they influence fairness? Yiannis Nikolakopoulos ioaniko@chalmers.se 14
Experiment Parameters Different levels of contention Number of threads Measured time intervals Yiannis Nikolakopoulos ioaniko@chalmers.se 15
Outline Queue Fairness Intel vs AMD Throughput vs Fairness Hash Table Intel vs AMD Scalability Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 16
Observations: Queue C# (.NET) 1 Fairness can change along different time intervals 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - Lock-free AMD - Lock-free Intel - TAS AMD - TAS Yiannis Nikolakopoulos ioaniko@chalmers.se 17
Observations: Queue Java Significantly different fairness behavior in different architectures 1 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Synchronized Intel - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 18
Observations: Queue Java Significantly different fairness behavior in different architectures 1 0,8 24 Threads, High contention Fairness Fairness 0,6 0,4 Lock-free is less affected in this case 0,2 0 1000 2000 3000 4000 5000 10000 Measurement interval (ms) Intel - TAS Intel - TTAS Intel - Synchronized Intel - Lock-free 400 600 800 AMD - TAS AMD - TTAS AMD - Synchronized AMD - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 19
Queue: Throughput vs Fairness Fairness 0.6 s, Intel Throughput C++ C++ 16 1 Operations per ms (thousands) 14 0,8 12 10 Fairness 0,6 8 0,4 6 4 0,2 2 0 0 2 4 6 8 12 24 48 2 4 6 8 12 24 48 Threads Threads TTAS Lock-free PMutex Yiannis Nikolakopoulos ioaniko@chalmers.se 20
Observations: Hash table Operations are distributed in different buckets Things get interesting when #threads > #buckets Tradeoff between throughput and fairness Different winners and losers Contention is lowered in the linked list components Yiannis Nikolakopoulos ioaniko@chalmers.se 21
Observations: Hash table Fairness differences in Hash table across architectures C# (Mono) 1 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 22
Observations: Hash table Fairness differences in Hash table across architectures C# (Mono) 1 0,8 24 Threads, High contention Fairness 0,6 0,4 Lock-free is again not affected 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Lock-free AMD - TAS AMD - TTAS AMD - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 23
Observations: Hash table In C++, custom memory management and lock-free implementations excel in scalability and performance. C++ Java 30 6 Sucessful operations per ms (thousands) 25 5 20 4 15 3 10 2 5 1 0 0 2 4 6 8 12 24 48 2 4 6 8 12 24 48 Threads TTAS Reentrant Threads TTAS TAS Array Lock Synchronized Lock-free Reentrant Fair TAS Lock-free Array Lock PMutex Lock-free, MM Yiannis Nikolakopoulos ioaniko@chalmers.se 24
Conclusion Which is the fastest/most scalable? Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots Scalability via more complex, inherently parallel designs and implementations Tradeoff between throughput and fairness LF Hash table Reentrant lock vs Array Lock vs LF Queue Fairness can be heavily influenced by HW Interesting exceptions Is fairness influenced by NUMA? Yiannis Nikolakopoulos ioaniko@chalmers.se 25