Performance Analysis of Synchronization Methods in Concurrent Data Structures

Behavior of Synchronization

Methods in Commonly Used

Languages and Systems

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Joint work with:

D. Cederman, B. Chatterjee, N. Nguyen,

M. Papatriantafilou, P. Tsigas

Distributed Computing and Systems

Chalmers University of Technology

Gothenburg, Sweden

Developing a multithreaded

application…

Yiannis Nikolakopoulos

ioaniko@chalmers.se

The boss

wants .NET

The client

wants speed…

(C++?)

Java is nice

Multicores

everywhere

Yiannis Nikolakopoulos

ioaniko@chalmers.se

The worker

threads need

to access data

Concurrent

Data Structures

Then we need

Synchronization

Developing a multithreaded

application…

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Implementing Concurrent Data

Structures

Performance

Bottleneck

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Implementing Concurrent Data

Structures

Hardware platform

Which is the

fastest/most

scalable?

Implementing concurrent data

structures

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Problem Statement

•

How the interplay of the above parameters

and the different synchronization methods,

affect the performance and the behavior of

concurrent data structures.

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Outline

Introduction

Experiment Setup

Highlights of Study

and Results

Conclusion

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Which data structures to study?

Represent different levels of contention:

•

Queue - 1 or 2 contention points

•

Hash table - multiple contention points

Yiannis Nikolakopoulos

ioaniko@chalmers.se

How do we choose implementation?

Possible criteria:

•

Framework dependencies

•

Programmability

•

“Good” performance

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Interpreting “good”

•

Throughput:

The more operations completed  per time unit

the better.

•

Is this enough?

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Non-fairness

Yiannis Nikolakopoulos

ioaniko@chalmers.se

What to measure?

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Operations by

thread i

Average

operations per

thread

Implementation Parameters

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Programming

Environments

C++

Java

C# (.NET,

Mono)

Synchronization

Methods

TAS, TTAS, Lock

free, Array lock

PMutex,

Lock

free memory

management

Reentrant,

synchronized

lock

construct,

utex

NUMA

Architectures

Intel Nehalem, 2 x 6 core

(24 HW threads)

AMD Bulldozer, 4 x 12 core

(48 HW threads)

Do they influence

fairness?

Experiment Parameters

•

Different levels of

contention

•

Number of threads

•

Measured time intervals

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Outline

•

Queue

–

Fairness

–

Intel vs AMD

–

Throughput vs Fairness

•

Hash Table

–

Intel vs AMD

–

Scalability

Introduction

Experiment Setup

Highlights of Study

and Results

Conclusion

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Fairness can change along

different time intervals

24 Threads, High contention

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Observations:

Queue

Significantly different

fairness behavior in

different architectures

24 Threads, High contention

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Observations:

Queue

Significantly different

fairness behavior in

different architectures

24 Threads, High contention

Lock-free is less affected

in this case

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Observations:

Queue

Queue: Throughput vs Fairness

Fairness 0.6 s, Intel

Throughput

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Observations:

Hash table

•

Operations are distributed in different buckets

•

Things get interesting when

#threads > #buckets

•

Tradeoff between throughput and fairness

–

Different winners and losers

–

Contention is lowered in the linked list

components

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Fairness differences in

Hash table across

architectures

24 Threads, High contention

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Observations:

Hash table

Fairness differences in

Hash table across

architectures

24 Threads, High contention

Lock-free is again not

affected

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Observations:

Hash table

Observations:

Hash table

In C++, custom memory management and lock-free implementations excel

in scalability and performance.

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Conclusion

•

Complex synchronization mechanisms (Pmutex,

Reentrant lock) pay off in heavily contended hot

spots

•

Scalability via more complex, inherently

parallel designs and implementations

•

Tradeoff between

throughput

and

fairness

–

LF Hash table

–

Reentrant lock vs Array Lock vs LF Queue

•

Fairness can be heavily influenced by HW

–

Interesting exceptions

Yiannis Nikolakopoulos

ioaniko@chalmers.se

Which is the

fastest/most

scalable?

Is

fairness

influenced by

NUMA

Slide Note

Embed Share

Download

Explore the impact of synchronization methods on the performance and behavior of concurrent data structures in multithreaded applications. The study involves developing and implementing concurrent data structures, analyzing coarse-grain locking, fine-grain locking, lock-free mechanisms, and assessing their scalability and speed in various programming languages. Key areas of investigation include contention points in queue and hash table data structures. The research aims to provide insights into optimizing synchronization for efficient distributed computing systems.

henne Follow

Uploaded on Sep 29, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos ioaniko@chalmers.se Joint work with: D. Cederman, B. Chatterjee, N. Nguyen, M. Papatriantafilou, P. Tsigas

Developing a multithreaded application The boss wants .NET Java is nice Multicores everywhere The client wants speed (C++?) Yiannis Nikolakopoulos ioaniko@chalmers.se 2

Developing a multithreaded application Concurrent Data Structures Then we need Synchronization. The worker threads need to access data Yiannis Nikolakopoulos ioaniko@chalmers.se 3

Implementing Concurrent Data Structures Performance Bottleneck Coarse Grain Locking Test And Set Array Locks Implementation Fine Grain Locking And more! Yiannis Nikolakopoulos ioaniko@chalmers.se 4

Implementing Concurrent Data Structures Coarse Grain Locking Test And Set Runtime System Hardware platform Array Locks Implementation Fine Grain Locking And more! Lock Free Which is the fastest/most scalable? Yiannis Nikolakopoulos ioaniko@chalmers.se 5

Implementing concurrent data structures Yiannis Nikolakopoulos ioaniko@chalmers.se 6

Problem Statement How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures. Yiannis Nikolakopoulos ioaniko@chalmers.se 7

Outline Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 8

Which data structures to study? Represent different levels of contention: Queue - 1 or 2 contention points Hash table - multiple contention points Yiannis Nikolakopoulos ioaniko@chalmers.se 9

How do we choose implementation? Possible criteria: Framework dependencies Programmability Good performance Yiannis Nikolakopoulos ioaniko@chalmers.se 10

Interpreting good Throughput: The more operations completed per time unit the better. Is this enough? Yiannis Nikolakopoulos ioaniko@chalmers.se 11

Non-fairness 12

What to measure? Throughput: Data structure operations completed per time unit. Operations by thread i Average operations per thread ??? ? min(??) ??? ? ???????? ?= ??? , ??? (??) Yiannis Nikolakopoulos ioaniko@chalmers.se 13

Implementation Parameters Programming Environments C++ Java C# (.NET, Mono) TAS, TTAS, Lock-free, Array lock Synchronization Methods PMutex, lock construct, Mutex Reentrant, synchronized Lock-free memory management NUMA Architectures Intel Nehalem, 2 x 6 core (24 HW threads) AMD Bulldozer, 4 x 12 core (48 HW threads) Do they influence fairness? Yiannis Nikolakopoulos ioaniko@chalmers.se 14

Experiment Parameters Different levels of contention Number of threads Measured time intervals Yiannis Nikolakopoulos ioaniko@chalmers.se 15

Outline Queue Fairness Intel vs AMD Throughput vs Fairness Hash Table Intel vs AMD Scalability Introduction Experiment Setup Highlights of Study and Results Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 16

Observations: Queue C# (.NET) 1 Fairness can change along different time intervals 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - Lock-free AMD - Lock-free Intel - TAS AMD - TAS Yiannis Nikolakopoulos ioaniko@chalmers.se 17

Observations: Queue Java Significantly different fairness behavior in different architectures 1 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Synchronized Intel - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 18

Observations: Queue Java Significantly different fairness behavior in different architectures 1 0,8 24 Threads, High contention Fairness Fairness 0,6 0,4 Lock-free is less affected in this case 0,2 0 1000 2000 3000 4000 5000 10000 Measurement interval (ms) Intel - TAS Intel - TTAS Intel - Synchronized Intel - Lock-free 400 600 800 AMD - TAS AMD - TTAS AMD - Synchronized AMD - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 19

Queue: Throughput vs Fairness Fairness 0.6 s, Intel Throughput C++ C++ 16 1 Operations per ms (thousands) 14 0,8 12 10 Fairness 0,6 8 0,4 6 4 0,2 2 0 0 2 4 6 8 12 24 48 2 4 6 8 12 24 48 Threads Threads TTAS Lock-free PMutex Yiannis Nikolakopoulos ioaniko@chalmers.se 20

Observations: Hash table Operations are distributed in different buckets Things get interesting when #threads > #buckets Tradeoff between throughput and fairness Different winners and losers Contention is lowered in the linked list components Yiannis Nikolakopoulos ioaniko@chalmers.se 21

Observations: Hash table Fairness differences in Hash table across architectures C# (Mono) 1 0,8 24 Threads, High contention Fairness 0,6 0,4 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 22

Observations: Hash table Fairness differences in Hash table across architectures C# (Mono) 1 0,8 24 Threads, High contention Fairness 0,6 0,4 Lock-free is again not affected 0,2 0 400 600 1000 2000 3000 4000 5000 10000 Measurement interval (ms) 800 Intel - TAS Intel - TTAS Intel - Lock-free AMD - TAS AMD - TTAS AMD - Lock-free Yiannis Nikolakopoulos ioaniko@chalmers.se 23

Observations: Hash table In C++, custom memory management and lock-free implementations excel in scalability and performance. C++ Java 30 6 Sucessful operations per ms (thousands) 25 5 20 4 15 3 10 2 5 1 0 0 2 4 6 8 12 24 48 2 4 6 8 12 24 48 Threads TTAS Reentrant Threads TTAS TAS Array Lock Synchronized Lock-free Reentrant Fair TAS Lock-free Array Lock PMutex Lock-free, MM Yiannis Nikolakopoulos ioaniko@chalmers.se 24

Conclusion Which is the fastest/most scalable? Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots Scalability via more complex, inherently parallel designs and implementations Tradeoff between throughput and fairness LF Hash table Reentrant lock vs Array Lock vs LF Queue Fairness can be heavily influenced by HW Interesting exceptions Is fairness influenced by NUMA? Yiannis Nikolakopoulos ioaniko@chalmers.se 25

Performance Analysis of Synchronization Methods in Concurrent Data Structures

Download Presentation

Presentation Transcript

Related

More Related Content