Dynamic Load Balancing on Graphics Processors: A Detailed Study

Slide Note
Embed
Share

In this comprehensive study by Daniel Cederman and Philippas Tsigas from Chalmers University of Technology, the focus is on dynamic load balancing on graphics processors. The research delves into the motivation, methods, experimental evaluations, and conclusions related to this critical area. It covers aspects such as the problem setting, static load balancing, dynamic load balancing, task sharing, system model, and synchronization techniques. Key concepts such as processor utilization, task allocation, task management, and system efficiency are thoroughly explored in this insightful work.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

  2. Overview Motivation Methods Experimental evaluation Conclusion

  3. The problem setting Work Offline Task Task Task Task Task Task Task Online Task Task Task Task

  4. Static Load Balancing Processor Processor Processor Processor

  5. Static Load Balancing Processor Processor Processor Processor Task Task Task Task

  6. Static Load Balancing Processor Processor Processor Processor Task Task Task Task

  7. Static Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask

  8. Static Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask

  9. Dynamic Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask

  10. Task sharing Check condition Work done? Done Acquire Task Task Set Try to get task Task Got task? No, retry Task Task Perform task Task No, continue New tasks? Task Add Task Add task

  11. System Model CUDA Global Memory Global Memory Multi- processor Multi- processor Multi- processor Gather and scatter Thread Block Thread Block Thread Block Compare-And-Swap Fetch-And-Inc Thread Block Thread Block Thread Block Multiprocessors Thread Block Thread Block Thread Block Maximum number of concurrent thread blocks

  12. Synchronization Blocking Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.

  13. Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List

  14. Blocking queue Free TB 1 Head TB 2 Tail TB n

  15. Blocking queue Free TB 1 Head TB 2 Tail TB n

  16. Blocking queue Free TB 1 Head TB 2 T1 Tail TB n

  17. Blocking queue Free TB 1 Head TB 2 T1 Tail TB n

  18. Blocking queue Free TB 1 Head TB 2 T1 Tail TB n

  19. Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]

  20. Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n

  21. Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n

  22. Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n

  23. Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n

  24. Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n

  25. Task stealing TB 1 T1 TB 2 T3 T2 TB n Reference Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

  26. Task stealing TB 1 T1 T4 TB 2 T3 T2 TB n

  27. Task stealing TB 1 T1 T4 T5 TB 2 T3 T2 TB n

  28. Task stealing TB 1 T1 T4 TB 2 T3 T2 TB n

  29. Task stealing TB 1 T1 TB 2 T3 T2 TB n

  30. Task stealing TB 1 TB 2 T3 T2 TB n

  31. Task stealing TB 1 TB 2 T2 TB n

  32. Static Task List In T1 T2 T3 T4

  33. Static Task List In T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4

  34. Static Task List In Out T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4

  35. Static Task List In Out T1 TB 1 T5 T2 TB 2 T3 TB 3 T4 TB 4

  36. Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T4 TB 4

  37. Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T7 T4 TB 4

  38. Octree Partitioning Bandwidth bound

  39. Octree Partitioning Bandwidth bound

  40. Octree Partitioning Bandwidth bound

  41. Octree Partitioning Bandwidth bound

  42. Four-in-a-row Computation intensive

  43. Graphics Processors 8800GT 9600GT 14 Multiprocessors 8 Multiprocessors 57 GB/sec bandwidth 57 GB/sec bandwidth

  44. Blocking Queue Octree/9600GT Time (ms) Time (ms) 500 600 400 400 300 200 200 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128

  45. Blocking Queue Octree/8800GT Time (ms) Time (ms) 800 600 800 600 400 400 200 200 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128

  46. Blocking Queue Four-in-a-row Time (ms) Time (ms) 2500 2000 2500 1500 2000 1000 1500 500 1000 500 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128

  47. Non-blocking Queue Octree/9600GT Time (ms) Time (ms) 200 250 150 200 150 100 100 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128

  48. Non-blocking Queue Octree/8800GT Time (ms) Time (ms) 200 250 200 150 150 100 100 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128

  49. Non-blocking Queue - Four-in-a-row Time (ms) Time (ms) 200 200 150 150 100 100 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128

  50. Task stealing Octree/9600GT Time (ms) Time (ms) 200 250 150 200 100 150 50 100 0 50 0 16 16 32 48 64 80 96 112 128 32 48 64 Threads 80 Blocks 96 112 128

Related