Zorua: A Holistic Resource Virtualization in GPUs Approach

Slide Note
Embed
Share

This paper presents Zorua, a holistic resource virtualization framework for GPUs that aims to reduce the dependence on programmer-specific resource usage, enhance resource efficiency in optimized code, and improve programming ease and performance portability. It addresses key issues such as static and dynamic underutilization of resources in GPUs, offering a solution to manage major on-chip resources efficiently. The paper discusses the challenges in managing GPU resources, the importance of decoupling programmer-specified resource usage from hardware allocation, and the benefits of Zorua in enhancing performance across different GPUs and classes of applications.


Uploaded on Sep 14, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Zorua: A Holistic Approach to Resource Virtualization in GPUs Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu

  2. Overview Problem: Major on-chip resources in GPUs are managed by the programmer/software Key Issues: Leads to several challenges in obtaining high performance: Programming Ease: Requires programmer effort to optimize resource usage Performance Portability: Optimizations do not port well across different GPUs Resource Inefficiency: Underutilized resources even in optimized code Our Goal: Reduce dependence of performance on programmer-specified resource usage Enhance resource efficiency for optimized code Our Approach: Decouple the programmer-specified resource usage from the allocation in the hardware Zorua: A Holistic Resource Virtualization Framework for GPUs Key Results: Zorua enhances programming ease, performance portability and performance for optimized code 2

  3. GPUs today are used across many classes of applications Computer Vision Machine Learning Biomedical Imaging Scientific Simulation GPU . . . 3

  4. On-Chip Resources in GPUs Register File Scratchpad Memory Thread Slots Compute Units 4

  5. Scratchpad Memory Register File Thread Slots Every thread in a thread block needs to be allocated enough (worst-case) resources to execute and complete Thread Block <#Threads,#Registers> <#Threads,#Registers,Scratchpad(KB)> <#Threads> Work Group 5 Thread Slots

  6. Abstraction of On-Chip Resources Programmer/Software <#Threads,#Registers,Scratchpad(KB)> per block Tight coupling between Hardware resource specification and allocation Scratchpad Memory Register File Thread Slots 6

  7. Key Issues 1. Static Underutilization 2. Dynamic Underutilization 7

  8. 1. Static Underutilization <#Threads,#Registers,Scratchpad(KB)> Thread Block Parallelism: 2 thread blocks 20 threads Static underutilization may lead to loss in parallelism 20 10 threads Thread Slots in Hardware Parallelism: 1 thread block 11 threads 8 11 threads

  9. To make things worse Same problem exists for other on-chip resources registers, scratchpad memory, thread blocks The programmer needs to get it right for all of them at the same time 9

  10. Implication 1: Programming Ease Normalized Execution 1.6 1.5 1.3 Requires programmer effort to avoid sub-optimal specifications 1.4 Time 1.2 1.1 1 Performance Cliffs 0.9 0.8 0 128 256 384 512 640 768 896 1024 Threads/Block 10 MST (Minimum Spanning Tree)

  11. Implication 2: Performance Portability Maxwell Kepler Fermi Normalized Execution 2.4 Programs need to be 2.2 2 1.8 1.4 retuned to fit different GPUs Time 1.6 1.2 1 0.8 0 128 256 384 512 Threads/Block 11 DCT (Discrete Cosine Transform)

  12. Key Issues 1. Static Underutilization 2. Dynamic Underutilization 12

  13. 2. Dynamic Underutilization Resource requirements of a thread vary throughout execution Implication: __global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; ... for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; Resource inefficiency due to worst-case static allocation 16 regs __syncthreads(); ... CUDAsubroutineInplaceDCTvector( ); 32 regs } for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; 16 regs 13

  14. Our Goal Reduce the dependence of performance on resource specification Programming Ease Performance Portability Improve efficiency of resource utilization Higher performance for optimized code 14

  15. Outline Problem: Tight Coupling Key Implications Our Goal Our Approach: Zorua Virtualization Strategy Design Challenges Design Ideas Evaluation 15

  16. Our Approach Virtual Resources Scratchpad Memory Register File Programmer/Software Thread Slots Hardware Resources Decouple resource specification from Virtualize multiple on-chip resources Zorua: A Holistic Virtualization Approach Scratchpad Memory Register File allocation Thread Slots 16

  17. How do we design a virtualization strategy to effectively address the key issues? 17

  18. 1. Static Underutilization <#Threads,#Registers,Scratchpad(KB)> Parallelism: 2 thread blocks 22 threads Flexibility in available resources helps restore parallelism 20 Parallelism: 1 thread block 11 threads 11 threads Thread Slots in Hardware 18

  19. Addressing Key Issues 1. Static Underutilization Provide an illusion of a flexible amount of resources 2. Dynamic Underutilization Enable dynamic allocation/deallocation of resources 19

  20. Outline Problem: Tight Coupling Key Implications Our Goal Our Approach: Zorua Virtualization Strategy Design Challenges Design Ideas Evaluation 20

  21. Zorua: Virtualization Strategy Virtual Resources Physical Resources Thread Slots Thread Slots Fine-grained dynamic allocation provides resource efficiency Register File Register File Scratchpad Memory Scratchpad Memory 21

  22. Zorua: Virtualization Strategy Swap Space (in the mem. hierarchy) Virtual Resources Physical Resources Thread Slots Thread Slots Careful oversubscription using a swap space provides flexibility in the amount of resources Register File Register File Scratchpad Memory Scratchpad Memory 22

  23. Outline Problem: Tight Coupling Key Implications Our Goal Our Approach: Zorua Virtualization Strategy Design Challenges Design Ideas Evaluation 23

  24. Zorua: Design Challenges Challenge 1: Controlling the extent of oversubscription Spills are expensive Challenge 2: Coordinating virtualization of multiple on-chip resources Resources are independently virtualized Resource requirements vary during execution 24

  25. Zorua Design: Key Questions How do we determine the variation in resource requirements? How do we use this knowledge to: control how much we oversubscribe at run time? coordinate allocation of multiple resources to maximize parallelism within the oversubscription budget? 25

  26. Outline Problem: Tight Coupling Key Implications Our Goal Our Approach: Zorua Virtualization Strategy Design Challenges Design Ideas Evaluation 26

  27. Component 1: The Compiler Leverage software to determine variation in resource requirements Variation in resource requirements tends to occur in fine-grained phases Use the compiler to: Statically partition the program into phases Add annotations with per-phase resource requirements __global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; __syncthreads(); 16 regs ... CUDAsubroutineInplaceDCTvector( ); 32 regs } for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; 16 regs 27

  28. Zorua Design: Key Questions How do we determine the variation in resource requirements? How do we use this knowledge to: control how much we oversubscribe at run time? coordinate allocation of multiple resources to maximize parallelism within the oversubscription budget? 28

  29. Component 2: Hardware Runtime System Pending Phase 1 Schedulable Phase 2 Allocate Required Resources Resources Deallocate Dead Phase 3 Warp Scheduler 29

  30. Putting It All Together Zorua: A hardware-software cooperative framework The compiler: annotates the program to partition it into phases and specify the resource needs of each phase The coordinator: a hardware runtime system that makes oversubscription decisions and allocates/deallocates resources Hardware virtualization support: Mapping tables for each resource (1.85kB 0.134% of the die area) Machinery to swap data between on-chip hardware & swap space 30

  31. Outline Problem: Tight Coupling Key Implications Our Goal Our Approach: Zorua Virtualization Strategy Design Challenges Design Ideas Evaluation 31

  32. Methodology Evaluation Infrastructure: Real GPUs (Fermi/Kepler/Maxwell), GPGPUSim, GPUWattch Workloads Lonestar, CUDA SDK System Parameters 15 SMs, 32 threads/warp Warps/SM: Fermi: 48, Kepler/Maxwell: 64 Registers: Fermi: 32768, Kepler/Maxwell: 65536 Scratchpad: Fermi/Kepler: 48KB, Maxwell: 64KB Core: 1.4GHz, GTO scheduler , 2 schedulers/SM Memory: 177.4GB/s BW, 6 GDDR5 Memory controllers Overheads of Zorua 2-cycle latency for mapping table lookup for each resource Memory requests for swap space accesses 32

  33. Effect on Performance Variation Baseline Zorua WLM Zorua reduces the dependence of performance on resource specification Performance variation across different specification points Lower Quartile Maximum Upper Quartile Average Minimum 33 * Xiang et al., HPCA 14

  34. Effect on Performance Cliffs Baseline WLM Zorua Normalized Execution Baseline WLM Zorua 2 2 1.5 Zorua alleviates the performance cliffs resulting from un-optimized specifications 1.5 Time 1 1 0.5 0.5 256 512 768 1024 9000 29000 49000 Scratchpad/Block Threads/Block 34 NQU MST

  35. Effect on Performance Portability Baseline WLM Zorua 150% Performance Loss Maximum Porting 100% 53% 24% 50% 0% BH DCT MST NQU RD SLA SP SSSP AVG 35

  36. Other Uses Resource sharing in multi-programmed environments Low latency preemption of kernels Dynamic parallelism 36

  37. Conclusion Problem: The tight coupling between programmer-specified resource usage and allocation of on-chip resources leads to challenges in: programming ease, performance portability, resource efficiency Our Approach: Decouple specification and management of on-chip resources Our Solution: Zorua: A holistic approach to virtualizing multiple on-chip resources in GPUs Key Results: Zorua reduces dependence of performance on programmer-specified resource usage Zorua enhances programming ease and performance portability Zorua improves performance with more efficient resource utilization Future Work: Zorua enables several other use cases 37

  38. Zorua: A Holistic Approach to Resource Virtualization in GPUs Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu

  39. 39

  40. A Walkthrough Coordinator Warp Scheduler Acquire resources Thread queue Scratchpad queue Register queue Thread Block Scheduler Scratchpad Mapping Table Thread Mapping Table Register Mapping Table Release resources 40

  41. Effect on schedulable warps 41

  42. Effect on energy consumption 42

  43. Summary of applications 43

Related


More Related Content