Dynamic Load Balancing on Graphics Processors: A Detailed Study
In this comprehensive study by Daniel Cederman and Philippas Tsigas from Chalmers University of Technology, the focus is on dynamic load balancing on graphics processors. The research delves into the motivation, methods, experimental evaluations, and conclusions related to this critical area. It covers aspects such as the problem setting, static load balancing, dynamic load balancing, task sharing, system model, and synchronization techniques. Key concepts such as processor utilization, task allocation, task management, and system efficiency are thoroughly explored in this insightful work.
Uploaded on Sep 15, 2024 | 0 Views
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology
Overview Motivation Methods Experimental evaluation Conclusion
The problem setting Work Offline Task Task Task Task Task Task Task Online Task Task Task Task
Static Load Balancing Processor Processor Processor Processor
Static Load Balancing Processor Processor Processor Processor Task Task Task Task
Static Load Balancing Processor Processor Processor Processor Task Task Task Task
Static Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask
Static Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask
Dynamic Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask
Task sharing Check condition Work done? Done Acquire Task Task Set Try to get task Task Got task? No, retry Task Task Perform task Task No, continue New tasks? Task Add Task Add task
System Model CUDA Global Memory Global Memory Multi- processor Multi- processor Multi- processor Gather and scatter Thread Block Thread Block Thread Block Compare-And-Swap Fetch-And-Inc Thread Block Thread Block Thread Block Multiprocessors Thread Block Thread Block Thread Block Maximum number of concurrent thread blocks
Synchronization Blocking Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.
Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List
Blocking queue Free TB 1 Head TB 2 Tail TB n
Blocking queue Free TB 1 Head TB 2 Tail TB n
Blocking queue Free TB 1 Head TB 2 T1 Tail TB n
Blocking queue Free TB 1 Head TB 2 T1 Tail TB n
Blocking queue Free TB 1 Head TB 2 T1 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n
Task stealing TB 1 T1 TB 2 T3 T2 TB n Reference Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
Task stealing TB 1 T1 T4 TB 2 T3 T2 TB n
Task stealing TB 1 T1 T4 T5 TB 2 T3 T2 TB n
Task stealing TB 1 T1 T4 TB 2 T3 T2 TB n
Task stealing TB 1 T1 TB 2 T3 T2 TB n
Task stealing TB 1 TB 2 T3 T2 TB n
Task stealing TB 1 TB 2 T2 TB n
Static Task List In T1 T2 T3 T4
Static Task List In T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T5 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T7 T4 TB 4
Octree Partitioning Bandwidth bound
Octree Partitioning Bandwidth bound
Octree Partitioning Bandwidth bound
Octree Partitioning Bandwidth bound
Four-in-a-row Computation intensive
Graphics Processors 8800GT 9600GT 14 Multiprocessors 8 Multiprocessors 57 GB/sec bandwidth 57 GB/sec bandwidth
Blocking Queue Octree/9600GT Time (ms) Time (ms) 500 600 400 400 300 200 200 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128
Blocking Queue Octree/8800GT Time (ms) Time (ms) 800 600 800 600 400 400 200 200 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128
Blocking Queue Four-in-a-row Time (ms) Time (ms) 2500 2000 2500 1500 2000 1000 1500 500 1000 500 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128
Non-blocking Queue Octree/9600GT Time (ms) Time (ms) 200 250 150 200 150 100 100 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128
Non-blocking Queue Octree/8800GT Time (ms) Time (ms) 200 250 200 150 150 100 100 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128
Non-blocking Queue - Four-in-a-row Time (ms) Time (ms) 200 200 150 150 100 100 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128
Task stealing Octree/9600GT Time (ms) Time (ms) 200 250 150 200 100 150 50 100 0 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128