Performance Analysis of Synchronization Methods in Concurrent Data Structures

 
Behavior of Synchronization
Methods in Commonly Used
Languages and Systems
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
Joint work with:
D. Cederman, B. Chatterjee, N. Nguyen,
M. Papatriantafilou, P. Tsigas
 
Distributed Computing and Systems
Chalmers University of Technology
Gothenburg, Sweden
Developing a multithreaded
application…
Yiannis Nikolakopoulos
ioaniko@chalmers.se
2
The boss
wants .NET
The client
wants speed…
(C++?)
Java is nice
Multicores
everywhere
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
3
The worker
threads need
to access data
Concurrent
Data Structures
Then we need
Synchronization
.
Developing a multithreaded
application…
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
4
Implementing Concurrent Data
Structures
Performance
Bottleneck
Yiannis Nikolakopoulos
ioaniko@chalmers.se
5
Implementing Concurrent Data
Structures
Hardware platform
Which is the
fastest/most
scalable?
 
Implementing concurrent data
structures
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
6
 
Problem Statement
 
How the interplay of the above parameters
and the different synchronization methods,
affect the performance and the behavior of
concurrent data structures.
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
7
 
Outline
 
 
Introduction
Experiment Setup
Highlights of Study
and Results
Conclusion
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
8
Which data structures to study?
 
Represent different levels of contention:
Queue - 1 or 2 contention points
Hash table - multiple contention points
 
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
9
 
How do we choose implementation?
 
Possible criteria:
Framework dependencies
Programmability
“Good” performance
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
10
 
Interpreting “good”
 
Throughput:
The more operations completed  per time unit
the better.
Is this enough?
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
11
Non-fairness
Yiannis Nikolakopoulos
ioaniko@chalmers.se
12
What to measure?
Yiannis Nikolakopoulos
ioaniko@chalmers.se
13
Operations by
thread i
Average
operations per
thread
 
Implementation Parameters
Yiannis Nikolakopoulos
ioaniko@chalmers.se
14
 
Programming 
Environments
C++
Java
C# (.NET,
Mono)
Synchronization
Methods
TAS, TTAS, Lock
-
free, Array lock
PMutex, 
Lock
-
free memory 
management
Reentrant, 
synchronized
lock
construct,
M
utex
NUMA
Architectures
Intel Nehalem, 2 x 6 core
(24 HW threads)
AMD Bulldozer, 4 x 12 core
(48 HW threads)
Do they influence
fairness?
 
Experiment Parameters
 
Different levels of 
contention
Number of threads
Measured time intervals
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
15
 
Outline
 
Queue
Fairness
Intel vs AMD
Throughput vs Fairness
Hash Table
Intel vs AMD
Scalability
 
Introduction
Experiment Setup
Highlights of Study
and Results
Conclusion
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
16
 
 
Fairness can change along
different time intervals
 
24 Threads, High contention
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
17
 
Observations: 
Queue
 
 
Significantly different
fairness behavior in
different architectures
 
24 Threads, High contention
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
18
 
Observations: 
Queue
 
F
a
i
r
n
e
s
s
 
 
Significantly different
fairness behavior in
different architectures
 
24 Threads, High contention
 
 
Lock-free is less affected
in this case
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
19
 
Observations: 
Queue
 
 
F
a
i
r
n
e
s
s
 
Queue: Throughput vs Fairness
 
Fairness 0.6 s, Intel
 
Throughput
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
20
Observations: 
Hash table
Operations are distributed in different buckets
Things get interesting when
  
#threads > #buckets
Tradeoff between throughput and fairness
Different winners and losers
Contention is lowered in the linked list
components
Yiannis Nikolakopoulos
ioaniko@chalmers.se
21
 
 
Fairness differences in
Hash table across
architectures
 
24 Threads, High contention
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
22
 
Observations: 
Hash table
 
 
Fairness differences in
Hash table across
architectures
 
24 Threads, High contention
 
 
 
 
Lock-free is again not
affected
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
23
 
Observations: 
Hash table
 
Observations: 
Hash table
 
In C++, custom memory management and lock-free implementations excel
in scalability and performance.
 
Yiannis Nikolakopoulos
ioaniko@chalmers.se
 
24
Conclusion
Complex synchronization mechanisms (Pmutex,
Reentrant lock) pay off in heavily contended hot
spots
Scalability via more complex, inherently
parallel designs and implementations
Tradeoff between 
throughput 
and 
fairness
LF Hash table
Reentrant lock vs Array Lock vs LF Queue
Fairness can be heavily influenced by HW
Interesting exceptions
Yiannis Nikolakopoulos
ioaniko@chalmers.se
25
Which is the
fastest/most
scalable?
Is 
fairness
influenced by
NUMA
?
Slide Note
Embed
Share

Explore the impact of synchronization methods on the performance and behavior of concurrent data structures in multithreaded applications. The study involves developing and implementing concurrent data structures, analyzing coarse-grain locking, fine-grain locking, lock-free mechanisms, and assessing their scalability and speed in various programming languages. Key areas of investigation include contention points in queue and hash table data structures. The research aims to provide insights into optimizing synchronization for efficient distributed computing systems.

  • Synchronization Methods
  • Concurrent Data Structures
  • Multithreading
  • Performance Analysis
  • Scalability

Uploaded on Sep 29, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos ioaniko@chalmers.se Joint work with: D. Cederman, B. Chatterjee, N. Nguyen, M. Papatriantafilou, P. Tsigas

  2. Developing a multithreaded application The boss wants .NET Java is nice Multicores everywhere The client wants speed (C++?) Yiannis Nikolakopoulos ioaniko@chalmers.se 2

  3. Developing a multithreaded application Concurrent Data Structures Then we need Synchronization. The worker threads need to access data Yiannis Nikolakopoulos ioaniko@chalmers.se 3

  4. Implementing Concurrent Data Structures Performance Bottleneck Coarse Grain Locking Test And Set Array Locks Implementation Fine Grain Locking And more! Yiannis Nikolakopoulos ioaniko@chalmers.se 4

  5. Implementing Concurrent Data Structures Coarse Grain Locking Test And Set Runtime System Hardware platform Array Locks Implementation Fine Grain Locking And more! Lock Free Which is the fastest/most scalable? Yiannis Nikolakopoulos ioaniko@chalmers.se 5

  6. Implementing concurrent data structures Yiannis Nikolakopoulos ioaniko@chalmers.se 6

  7. Problem Statement How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures. Yiannis Nikolakopoulos ioaniko@chalmers.se 7

  8. Outline Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 8

  9. Which data structures to study? Represent different levels of contention: Queue - 1 or 2 contention points Hash table - multiple contention points Yiannis Nikolakopoulos ioaniko@chalmers.se 9

  10. How do we choose implementation? Possible criteria: Framework dependencies Programmability Good performance Yiannis Nikolakopoulos ioaniko@chalmers.se 10

  11. Interpreting good Throughput: The more operations completed per time unit the better. Is this enough? Yiannis Nikolakopoulos ioaniko@chalmers.se 11

  12. Non-fairness 12

  13. What to measure? Throughput: Data structure operations completed per time unit. Operations by thread i Average operations per thread ??? ? min(??) ??? ? ???????? ?= ??? , ??? (??) Yiannis Nikolakopoulos ioaniko@chalmers.se 13

  14. Implementation Parameters Programming Environments C++ Java C# (.NET, Mono) TAS, TTAS, Lock-free, Array lock Synchronization Methods PMutex, lock construct, Mutex Reentrant, synchronized Lock-free memory management NUMA Architectures Intel Nehalem, 2 x 6 core (24 HW threads) AMD Bulldozer, 4 x 12 core (48 HW threads) Do they influence fairness? Yiannis Nikolakopoulos ioaniko@chalmers.se 14

  15. Experiment Parameters Different levels of contention Number of threads Measured time intervals Yiannis Nikolakopoulos ioaniko@chalmers.se 15

  16. Outline Queue Fairness Intel vs AMD Throughput vs Fairness Hash Table Intel vs AMD Scalability Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 16

  17. Observations: Queue C# (.NET) 1 Fairness can change along different time intervals 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - Lock-free AMD - Lock-free Intel - TAS AMD - TAS Yiannis Nikolakopoulos ioaniko@chalmers.se 17

  18. Observations: Queue Java Significantly different fairness behavior in different architectures 1 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Synchronized Intel - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 18

  19. Observations: Queue Java Significantly different fairness behavior in different architectures 1 0,8 24 Threads, High contention Fairness Fairness 0,6 0,4 Lock-free is less affected in this case 0,2 0 1000 2000 3000 4000 5000 10000 Measurement interval (ms) Intel - TAS Intel - TTAS Intel - Synchronized Intel - Lock-free 400 600 800 AMD - TAS AMD - TTAS AMD - Synchronized AMD - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 19

  20. Queue: Throughput vs Fairness Fairness 0.6 s, Intel Throughput C++ C++ 16 1 Operations per ms (thousands) 14 0,8 12 10 Fairness 0,6 8 0,4 6 4 0,2 2 0 0 2 4 6 8 12 24 48 2 4 6 8 12 24 48 Threads Threads TTAS Lock-free PMutex Yiannis Nikolakopoulos ioaniko@chalmers.se 20

  21. Observations: Hash table Operations are distributed in different buckets Things get interesting when #threads > #buckets Tradeoff between throughput and fairness Different winners and losers Contention is lowered in the linked list components Yiannis Nikolakopoulos ioaniko@chalmers.se 21

  22. Observations: Hash table Fairness differences in Hash table across architectures C# (Mono) 1 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 22

  23. Observations: Hash table Fairness differences in Hash table across architectures C# (Mono) 1 0,8 24 Threads, High contention Fairness 0,6 0,4 Lock-free is again not affected 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Lock-free AMD - TAS AMD - TTAS AMD - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 23

  24. Observations: Hash table In C++, custom memory management and lock-free implementations excel in scalability and performance. C++ Java 30 6 Sucessful operations per ms (thousands) 25 5 20 4 15 3 10 2 5 1 0 0 2 4 6 8 12 24 48 2 4 6 8 12 24 48 Threads TTAS Reentrant Threads TTAS TAS Array Lock Synchronized Lock-free Reentrant Fair TAS Lock-free Array Lock PMutex Lock-free, MM Yiannis Nikolakopoulos ioaniko@chalmers.se 24

  25. Conclusion Which is the fastest/most scalable? Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots Scalability via more complex, inherently parallel designs and implementations Tradeoff between throughput and fairness LF Hash table Reentrant lock vs Array Lock vs LF Queue Fairness can be heavily influenced by HW Interesting exceptions Is fairness influenced by NUMA? Yiannis Nikolakopoulos ioaniko@chalmers.se 25

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#