Energy-Efficient GPU Design with Spatio-Temporal Shared-Thread Speculative Adders

Slide Note
Embed
Share

Explore the significance of GPUs in modern systems, with emphasis on their widespread adoption and performance improvements over the years. The focus is on the need for low-power adders in GPUs due to high arithmetic intensity in GPU workloads.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. ST2 GPU: An Energy-Efficient GPU Design with Spatio-Temporal Shared-Thread Speculative Adders Vijay Kandiah, Ali Murat Gok, Georgios Tziantzioulis, Nikos Hardavellas PARALLEL ARCHITECTURE GROUP @NORTHWESTERN San Francisco, California DEC 5 - 9, 2021

  2. GPUs are Important Systems GPUs are Important Systems Widespread adoption of GPUs (AI, machine learning, HPC, ...) 70% of top 50 HPC applications are GPU-accelerated 29% of TOP500 supercomputer systems are GPU-accelerated (and growing) others: 2.6% NVIDIA Volta GV100 1.2% NVIDIA Pascal GP100 NVIDIA Ampere GA100 2.6% % TOP500 systems with GPUs % TOP500 systems with GPUs 22% 2

  3. Core Count Increases but Power Stays Same Core Count Increases but Power Stays Same Fewer Watts per CUDA core 12000 350 10752 Thermal Design Power 300 10000 250 250 250 250 250 250 244 235 300 230 250 CUDA Cores 8000 195 (in Watts) 200 5120 6912 6000 4608 150 3840 3584 3072 2880 4000 100 1536 2000 512 480 50 0 0 Fermi GF100 (2010) Fermi GF110 (2011) Kepler GK104 (2012) Kepler GK110 (2013) Maxwell GM200 (2015) Pascal GP100 (2016) Pascal GP102 (2017) Volta GV100 (2017) Turing TU102 (2018) Ampere GA100 (2020) Ampere GA102 (2020) 3

  4. GPU Workloads Perform a lot of Additions GPU Workloads Perform a lot of Additions >20% of executed instrs. are ALU & FPU instrs. in 21 of 23 kernels 32% ALU+FPU Adds 100% Dynamic Instructions Other FPU Other FPU Add ALU Other ALU Add 80% 60% 40% 20% 0% kmeans_K1 msort_K1 msort_K2 dct8x8_K1 sgemm walsh_K1 walsh_K2 pathfinder mri-q_K1 binomial sobolQrng sradv1_K1 dwt2d_K1 b+tree_K1 b+tree_K2 sad_K1 bprop_K2 bprop_K1 histo_K1 Average qrng_K2 qrng_K1 sortNets_K1 sortNets_K2 4

  5. Motivation Motivation Widespread adoption of GPUs (AI, machine learning, HPC, ...) Fewer Watts per CUDA core High arithmetic intensity in GPU workloads We need low power adders in GPUs! 5

  6. Voltage Voltage- -Scaled Sliced Adders Scaled Sliced Adders Clock period Conventional adder: 64-bit KSA 6

  7. Voltage Voltage- -Scaled Sliced Adders Scaled Sliced Adders Clock period Conventional adder: 64-bit KSA Sliced adder: Clock period Carry ... 8-bit KSA Carry Slice 2 8-bit KSA Carry Slice 1 8-bit KSA Slice 0 7

  8. Voltage Voltage- -Scaled Sliced Adders Scaled Sliced Adders Clock period Conventional adder: 64-bit KSA Sliced speculative adder: Clock period ... ... Speculated Carry Time slack Carry Prediction 8-bit KSA Slice 2 Speculated Carry 8-bit KSA Slice 1 8-bit KSA Slice 0 8

  9. Voltage Voltage- -Scaled Sliced Adders Scaled Sliced Adders Clock period Conventional adder: 64-bit KSA Voltage-scaled sliced adder: Clock period ... ... Speculated Carry Carry Prediction Slice 2 8-bit Vdd-scaled KSA Speculated Carry 8-bit Vdd-scaled KSA Slice 1 Slice 0 8-bit Vdd-scaled KSA 9

  10. Voltage Voltage- -Scaled Sliced Adders Scaled Sliced Adders Clock period Conventional adder: 64-bit KSA Voltage-scaled sliced adder: Clock period ... ... Speculated Carry Carry Prediction Slice 2 8-bit Vdd-scaled KSA Speculated Carry 8-bit Vdd-scaled KSA We need innovation in: Accurate and efficient carry prediction Guaranteed correctness Integration into GPU pipeline Slice 1 Slice 0 8-bit Vdd-scaled KSA 10

  11. Carry Prediction in GPUs is Challenging Carry Prediction in GPUs is Challenging Must be accurate Correctness required, but 2x latency to fix each misprediction One thread mispredicts all 32 threads in warp stalled for 2x cycles Must be efficient GV100: 815 mm2 @ 1.45GHz need fast predictions with practically no area overhead 5,120 ALU & 5,120 FPU cores >66,000 predictions per cycle, per chip Must be integrated into GPU pipeline 163,840 concurrent threads Too much bookkeeping & contention on history updates Variable latency ADDs to guarantee correctness 11

  12. ST ST2 2 GPU Contributions GPU Contributions Observe & quantify spatio-temporal value correlation Design-space exploration of carry speculation mechanisms ST2 adders Speculative, guarantee correctness Save 75-87% of adder energy, outperform state-of-the-art by 67% ST2 GPU Integrate ST2 adders into GPU warp pipeline Reduces GPU chip energy consumption by 21% Practically no impact on performance and area 12

  13. How to Predict Carry Chains? How to Predict Carry Chains? Hot loop in a real-world GPU workload, Pathfinder: for (int i=0; i < iteration ; i++) { ... if ( (tx>=(i+1) && (tx<=(BLOCK_SIZE-2-i)) ) && isValid) { ... int shortest = MIN(left, up); shortest = MIN(shortest, right); int index = cols*(startStep+i)+xidx; result[tx] = shortest + gpuWall[index]; } ... } 13

  14. How to Predict Carry Chains? How to Predict Carry Chains? Hot loop in a real-world GPU workload, Pathfinder: for (int i=0; i < iteration ; i++) { ... if ( (tx>=(i+1) && (tx<=(BLOCK_SIZE-2-i)) ) && isValid) { ... int shortest = MIN(left, up); shortest = MIN(shortest, right); int index = cols*(startStep+i)+xidx; result[tx] = shortest + gpuWall[index]; } ... } PC3 PC2 PC1 PC4 PC5 PC6 PC7 14

  15. How to Predict Carry Chains? How to Predict Carry Chains? PC1 PC2 PC3 PC4 PC5 PC6 PC7 400000 Arithmetic Op. Result iteration 1 300000 200000 100000 0 300 200 100 0 -100 0 5 10 15 20 25 30 Logical Time 15

  16. How to Predict Carry Chains? How to Predict Carry Chains? PC1 PC2 PC3 PC4 PC5 PC6 PC7 400000 Arithmetic Op. Result iteration 1 300000 200000 100000 0 300 200 100 0 -100 0 5 10 15 20 25 30 Logical Time Temporal Correlation is not enough! 16

  17. How to Predict Carry Chains? How to Predict Carry Chains? PC1 PC2 PC3 PC4 PC5 PC6 PC7 iteration 4 iteration 3 iteration 2 400000 Arithmetic Op. Result iteration 1 300000 200000 100000 0 300 200 100 0 -100 0 5 10 15 20 25 30 Logical Time 17

  18. How to Predict Carry Chains? How to Predict Carry Chains? PC1 PC2 PC3 PC4 PC5 PC6 PC7 iteration 4 iteration 3 iteration 2 400000 Arithmetic Op. Result iteration 1 300000 200000 100000 0 300 200 100 0 -100 0 5 10 15 20 25 30 Logical Time Spatio-Temporal Value Correlation 18

  19. Quantifying Quantifying Spatio Spatio- -Temporal Value Correlation Temporal Value Correlation Correlation: Temporal Prev + Gtid Prev + FullPC + Gtid Prev + FullPC + Ltid Carry Pred. Accuracy 100% 80% 50% 60% 40% 20% 0% msort_K1 msort_K2 sgemm mri-q_K1 kmeans_K1 walsh_K1 walsh_K2 binomial pathfinder sobolQrng dct8x8_K1 sradv1_K1 dwt2d_K1 b+tree_K1 b+tree_K2 histo_K1 sad_K1 Average bprop_K2 bprop_K1 qrng_K2 qrng_K1 sortNets_K1 sortNets_K2 19

  20. Quantifying Quantifying Spatio Spatio- -Temporal Value Correlation Temporal Value Correlation Correlation: Temporal Spatio-Temporal Prev + Gtid Prev + FullPC + Gtid Prev + FullPC + Ltid Carry Pred. Accuracy 100% 83% 80% 60% 40% 20% 0% msort_K1 msort_K2 sgemm mri-q_K1 kmeans_K1 walsh_K1 walsh_K2 binomial pathfinder sobolQrng dct8x8_K1 sradv1_K1 dwt2d_K1 b+tree_K1 b+tree_K2 histo_K1 sad_K1 Average bprop_K2 bprop_K1 qrng_K2 qrng_K1 sortNets_K1 sortNets_K2 20

  21. Quantifying Quantifying Spatio Spatio- -Temporal Value Correlation Temporal Value Correlation Challenging to implement! Per-thread bookkeeping 2048 threads / SM Correlation: Temporal Spatio-Temporal Prev + Gtid Prev + FullPC + Gtid Prev + FullPC + Ltid Carry Pred. Accuracy 100% 83% 80% 60% 40% 20% 0% msort_K1 msort_K2 sgemm mri-q_K1 kmeans_K1 walsh_K1 walsh_K2 binomial pathfinder sobolQrng dct8x8_K1 sradv1_K1 dwt2d_K1 b+tree_K1 b+tree_K2 histo_K1 sad_K1 Average bprop_K2 bprop_K1 qrng_K2 qrng_K1 sortNets_K1 sortNets_K2 21

  22. Quantifying Quantifying Spatio Spatio- -Temporal Value Correlation Temporal Value Correlation Correlation: Temporal Spatio-Temporal Spatio-Temporal Shared-Thread Prev + Gtid Prev + FullPC + Gtid Prev + FullPC + Ltid Carry Pred. Accuracy 100% 89% 80% 60% 40% 20% 0% msort_K1 msort_K2 sgemm mri-q_K1 kmeans_K1 walsh_K1 walsh_K2 binomial pathfinder sobolQrng dct8x8_K1 sradv1_K1 dwt2d_K1 b+tree_K1 b+tree_K2 histo_K1 sad_K1 Average bprop_K2 bprop_K1 qrng_K2 qrng_K1 sortNets_K1 sortNets_K2 22

  23. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms Unsurprisingly, a static zero speculation suffers high error rates (static one is much worse) 40% Avg. Thread Misprediction Rate 30% 20% 10% 0% 23

  24. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms VaLHALLA (state-of-the-art): reduce mispredictions via dynamic speculation not always necessary 40% Avg. Thread Misprediction Rate 25% reduction 30% 20% 10% 0% 24

  25. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms Peek at MSb of input to previous slice to make static prediction; dynamic prediction used only when static not possible 40% Avg. Thread Misprediction Rate Still based on VaLHALLA: broadcasts same single-bit prediction to all slices 30% 20% 10% 0% 25

  26. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms Instead, predict carry per slice using previous history (last arithmetic op.) (temporal correlation) 40% Avg. Thread Misprediction Rate 30% 26% reduction 20% 10% But, temporal correlation alone is not enough 0% 26

  27. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms Spatio-temporal correlation: keep history of last 2k arithmetic operations (use lowest k PC bits in history table index ModPCk) remember 4 PCs 40% remember 2 PCs Avg. Thread Misprediction Rate remember 8 PCs 30% remember 16 PCs 20% 10% 0% 27

  28. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms Spatio-temporal correlation: keep history of last 2k arithmetic operations (use lowest k PC bits in history table index ModPCk) 40% Avg. Thread Misprediction Rate 30% 57% reduction 20% 10% 0% Impractical to implement 28

  29. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms Use global thread ID to keep per-thread history Low performance + challenging to implement! Per-thread bookkeeping 2048 threads / SM 40% Avg. Thread Misprediction Rate 30% 20% 10% 0% 29

  30. Exploring Carry Speculation Mechanisms Exploring Carry Speculation Mechanisms Keep per-thread history in warp (use Local thread ID); share across warps. Sharing history across warps increases accuracy 40% Avg. Thread Misprediction Rate 30% 67% reduction ST2 20% 10% 0% 30

  31. GPU SM Pipeline Design GPU SM Pipeline Design Register Read Fetch Decode Issue Execute Write-Back Instruction Buffer SFUs, MULs, DIVs SFUs, MULs, DIVs Register File - Read Register File - Write Instruction Cache LD/ST Units LD/ST Units Warp Scheduler Functional Units Decode FPUs (FP32, FP64) FPUs Ready Fetch ALUs ALUs Scoreboard Release 31

  32. ST ST2 2 GPU SM Pipeline Design GPU SM Pipeline Design Register Read Fetch Decode Issue Execute Write-Back Instruction Buffer SFUs, MULs, DIVs SFUs, MULs, DIVs Register File - Read Register File - Write Instruction Cache LD/ST Units LD/ST Units Warp Scheduler Functional Units Decode FPUs (FP32, FP64) Carry Register File - Read FPUs Carry Register File - Write Ready Fetch ALUs ALUs Scoreboard Stall Stall Release Level-Down Voltage Shifters Level-Up Voltage Shifters 32

  33. ST ST2 2 GPU SM Pipeline Design GPU SM Pipeline Design Register Read Fetch Decode Issue Execute Write-Back Instruction Buffer SFUs, MULs, DIVs SFUs, MULs, DIVs Register File - Read Register File - Write Instruction Cache LD/ST Units LD/ST Units Warp Scheduler Functional Units Decode FPUs (FP32, FP64) Carry Register File - Read FPUs Carry Register File - Write Ready Fetch ALUs ALUs Scoreboard Stall Stall Release Level-Down Voltage Shifters Level-Up Voltage Shifters 33

  34. ST ST2 2 Carry Speculation Unit Design Carry Speculation Unit Design PC[3:0] Previous Carry History Table Thread 31 Cpred[223:217] Cpred[223:0] Slice 6 Slice 5 Slice 3 Slice 2 Slice 1 Slice 0 Slice 7 Slice 4 Carry Register [15] Thread 30 Cpred[216:210] Mux Carry Register [14] Slice 6 Slice 5 Slice 3 Slice 2 Slice 1 Slice 0 Slice 7 Slice 4 Cpred[6:0] Thread 0 Carry Register [0] Slice 6 Slice 5 Slice 3 Slice 2 Slice 1 Slice 0 Slice 7 Slice 4 224 Bits 34

  35. ST ST2 2 Carry Speculation Unit Design Carry Speculation Unit Design PC[3:0] Previous Carry History Table Thread 31 Cpred[223:217] Cpred[223:0] Slice 6 Slice 5 Slice 3 Slice 2 Slice 1 Slice 0 Slice 7 Slice 4 Carry Register [15] Thread 30 Cpred[216:210] Mux Carry Register [14] Slice 6 Slice 5 Slice 3 Slice 2 Slice 1 Slice 0 Slice 7 Slice 4 Cpred[6:0] Thread 0 Carry Register [0] Slice 6 Slice 5 Slice 3 Slice 2 Slice 1 Slice 0 Slice 7 Slice 4 224 Bits 35

  36. ST ST2 2 Adder Design Adder Design Previous Carry History Table ST2 Adder, Thread 0 Cpred[0] Cpred[4] Cpred[1] Cpred[3] Cpred[6] Cpred[5] Cpred[2] SUB Slice 7 Bits [63:56] Slice 6 Bits [55:48] Slice 5 Bits [47:40] Slice 4 Bits [39:32] Slice 3 Bits [31:24] Slice 2 Bits [23:16] Slice 1 Bits [15:8] Slice 0 Bits [7:0] Cout Cin E[7] E[6] E[4] E[3] S[7] E[5] E[2] E[1] S[6] S[5] S[4] S[3] S[2] S[1] 36

  37. ST ST2 2 Adder Design Adder Design Previous Carry History Table ST2 Adder, Thread 0 Cpred[0] Cpred[4] Cpred[1] Cpred[3] Cpred[6] Cpred[5] Cpred[2] SUB Slice 7 Bits [63:56] Slice 6 Bits [55:48] Slice 5 Bits [47:40] Slice 4 Bits [39:32] Slice 3 Bits [31:24] Slice 2 Bits [23:16] Slice 1 Bits [15:8] Slice 0 Bits [7:0] Cout Cin E[7] E[6] E[4] E[3] S[7] E[5] E[2] E[1] S[6] S[5] S[4] S[3] S[2] S[1] 37

  38. ST ST2 2 Adder Design Adder Design Slice 0 Bits [7:0] Input Register Op2[7:0] Op1[7:0] SUB Cout[0] Cin 8-bit ADD/SUB Slice 0 Out[7:0] Output Register 38

  39. ST ST2 2 Adder Design Adder Design Slice 5 Bits [47:40] Slice 0 Bits [7:0] Cpred[4] Input Register Input Register Slice 5 Op1[39] Cin[5] 1 0 Op2[7:0] Op1[7:0] Op2[39] Op2[47:40] Op1[47:40] SUB Cout[0] SUB Cin 8-bit ADD/SUB 1 0 8-bit ADD/SUB Cout Slice 0 Cout[5] 1 Cin state DFF S[5] Cout DFF 0 Clk Out[7:0] Out[47:40] Output Register Output Register Reset E[5] 39

  40. Methodology Methodology Synthesis: Verilog Synopsys Design Compiler (SAED 90nm) Circuit Simulations: Synopsys Hspice (Full Analog) Baseline GPU Architecture: NVIDIA TITAN V Volta (GV100) System Power Estimations: in-house GPUWattch-based power model validated against real HW (GV100) System Performance Simulations: GPGPU-Sim 3.x in PTX simulation mode Workloads: 9 from NVIDIA CUDA Samples, 6 from Rodinia, and 3 from Parboil 40

  41. Accurate Power Model, Validated Against Silicon Accurate Power Model, Validated Against Silicon Power model calibrated against GV100 with 123 benchmarks and least-square-error solver Validated against real silicon using HW power measurements collected from chip sensors Average abs. relative error: 10.5%; Pearson r coefficient: 0.8 on ST2 evaluation workloads 220 Modeled Power (W) 200 180 160 140 120 100 80 60 40 20 0 0 20 40 60 80 100 120 140 160 180 200 220 Power Measured on Hardware (W) 41

  42. ST ST2 2 GPU Saves GPU System Energy GPU Saves GPU System Energy 70% ALU+FPU energy savings ALU+FPU Intensive: >20% ALU+FPU System Energy 1.0N 0.8 1.0 Norm. System Energy 0.8 0.6 Others f p 0.6 0.2 0.4 ALU+FPU 0.4 0.0 0.2 Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base Base ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 ST2 0.0 21% GPU chip energy savings, 19% chip+DRAM savings 42

  43. We Dont Affect Performance We Don t Affect Performance Norm. Execution Time 1.0 0.8 0.6 0.4 0.2 0.0 sradv1_K1 walsh_K2 msort_K1 msort_K2 kmeans_K1 sgemm walsh_K1 sobolQrng dct8x8_K1 pathfinder mri-q_K1 binomial b+tree_K1 b+tree_K2 dwt2d_K1 sad_K1 bprop_K2 bprop_K1 histo_K1 Average sortNets_K1 qrng_K2 sortNets_K2 qrng_K1 43

  44. Less Than 0.75% Area Overhead Less Than 0.75% Area Overhead Storage: 50KB, i.e., 0.09% of on-chip caches and register file area 448-BYTE CRF (16 x 224 bits) per SM 35KB 2 bits for state and Count DFFs per slice 15KB Voltage shifters: <0.68% Volta GV100 chip area 44

  45. Conclusions Conclusions Spatio-temporal value correlation Design-space exploration Winner: spatio-temporal history + history sharing + disciplined static predictions ST2 adders Speculative, guarantee correctness Save 75-87% of adder energy, outperform state-of-the-art by 67% ST2 GPU Integrate ST2 adders into GPU warp pipeline Reduces GPU chip energy consumption by 21% Practically no impact on performance and area 45

  46. Conclusions Conclusions Spatio-temporal value correlation Design-space exploration Winner: spatio-temporal history + history sharing + disciplined static predictions ST2 adders Speculative, guarantee correctness Save 75-87% of adder energy, outperform state-of-the-art by 67% ST2 GPU Integrate ST2 adders into GPU warp pipeline Reduces GPU chip energy consumption by 21% Practically no impact on performance and area THANK YOU! Questions? 46

Related