Toggle-Aware Compression for GPU Systems
Data compression can reduce bandwidth pressure, but it also increases energy costs due to bit toggles. Toggle-Aware Compression Energy Control (EC) and Metadata Consolidation (MC) aim to mitigate this issue. This approach reduces bit toggles, maintaining performance benefits. The importance of energy efficiency in data-intensive applications like memory caching, databases, and graphics is highlighted, emphasizing the need to manage computation versus communication costs. Leveraging redundancy in memory transfers through bandwidth compression can enhance effective bandwidth without additional hardware overhead. The effectiveness of compression in GPU systems, as shown by compression ratios and applications from NVIDIA, is discussed. While compression typically offers higher capacity, bandwidth, and energy efficiency, challenges like increased bit toggle count and overhead complexity must be addressed.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A Case for Toggle-Aware Compression for GPU Systems Gennady Pekhimenko, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry Evgeny Bolotin, Stephen W. Keckler
Executive Summary Data compression is a known technique to decrease the bandwidth pressure Observation: Compression significantly increases the energy cost of communication by increasing the number of bit toggles (bit flips) Our approach: Toggle-Aware Compression Energy Control (EC): send compressed data only when it is beneficial Metadata Consolidation (MC): consolidates metadata bits to reduce the bit toggle count Key results: 2.2X increase in bit toggles reduced to only 1.1X with most of the performance benefits preserved 2
Performance and Energy Efficiency Energy efficiency Applications today are data-intensive Memory Caching Databases Graphics 3
Computation vs. Communication Modern memory systems are bandwidth constrained Data movement is very costly Integer operation: ~1 pJ Floating operation: ~20 pJ Low-power memory access: ~1200 pJ Implication Transfer less or keep data near processing units Sources: Bill Dally (NVIDIA/Stanford) Kayvon Fatahalian (CMU) 4
Potential for Data Compression Significant redundancy in memory transfers: 0x00000000 0x0000000B 0x00000003 0x00000004 How can we exploit this redundancy? Bandwidth compression Provides effect of a higher effective bandwidth without increasing the number of wires or raising the frequency 5
Bandwidth Compression for GPUs 221 apps from NVIDIA 21 apps 1.8 1.7 Compression Ratio 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 Fibonacci FPC FPC FPC Fibonacci Fibonacci BDI BDI BDI C-Pack C-Pack C-Pack BDI+FPC BDI+FPC BDI+FPC Discrete Mobile Open-Source Compression is effective in providing higher bandwidth 6
Common Wisdom about Compression Higher effective capacity Higher effective bandwidth (Usually) lower energy consumption A new problem: Significant increase in the bit toggle count (# bit flips), despite less bits sent Compression/Decompression overhead Complexity/Cost to support variable size 7
What is a Bit Toggle? How energy is spent in data transfers: New data: Previous data: 0101 0011 Energy: 0 0 1 Energy = C*V2 0 1 Bit Toggles 0 1 1 Energy of data transfers (e.g., NoC, DRAM) is proportional to the bit toggle count 8
Excessive Number of Bit Toggles Uncompressed Cache Line 0x00003A00 0x8001D000 0x00003A01 0x8001D008 Flit 0 XOR Flit 1 = # Toggles = 2 000000010 .01000 Compressed Cache Line (FPC) 0x5 0x3A01 0x7 8001D008 0x5 0x3A00 0x7 8001D000 5 3A00 7 8001D000 5 1D Flit 0 XOR Flit 1 1 01 7 8001D008 5 3A02 1 = # Toggles = 31 001001111 110100011000 9
Effect of Compression on Bit Toggles 2.2 Normalized Bit Toggle # 2 1.8 1.6 1.4 1.2 1 0.8 FPC C-Pack FPC C-Pack FPC C-Pack Fibonacci Fibonacci Fibonacci BDI BDI+FPC BDI BDI+FPC BDI BDI+FPC Discrete Mobile Open-Source Compression significantly increases bit toggle count 10
Outline Motivation Key Observations Toggle-Aware Compression: Energy Control (EC) Metadata Consolidation (MC) Evaluation Conclusion 11
Energy Control Decision Flow Send compressed or uncompressed $Line Select Compress $Line Comp. $Line CR Count Toggles T0 EC Decision T1 BW Utilization 12
How to Make the EC Decision? Energy Battery life Energy X Delay Balance performance and energy Energy X Delay2 Fixed power with voltage scaling Energy: ~ Toggle #,Delay ~ 1/(Comp. Ratio) When bandwidth utilization (BU) is high (>50%) use 1 / (1 - BU) coefficient 13
EC in the System Energy Control Energy Control Decompressor Decompressor Compressor/ Compressor/ Streaming Multiprocessor LLC NoC L1D Energy Control Energy Control Decompressor Decompressor Compressor/ Compressor/ DRAM LLC off-chip bus 14
Energy Control Summary Bit toggle count: compressed vs. uncompressed Use a heuristic (Energy X Delay or Energy X Delay2metric) to estimate the trade-off Take bandwidth utilization into account Throttle compression when it is not beneficial 15
Metadata Consolidation Compressed Cache Line with FPC, 4-byte flits 0x5, 0x3A02, 0x5, 0x3A03, 0x5,0x3A00, 0x5, 0x3A01, # Toggles = 18 Toggle-aware FPC: all metadata consolidated 0x5 0x5 0x5 0x3A02, 0x3A03, 0x3A00, 0x3A01, Consolidated Metadata # Toggles = 2 16
Outline Motivation Key Observations Toggle-Aware Compression: Energy Control (EC) Metadata Consolidation (MC) Evaluation Conclusion 17
Methodology Simulator: GPGPU-Sim 3.2.x and in-house simulator Workloads: NVIDIA apps (discrete and mobile): 221 apps Open-source (Lonestar, Rodinia, MapReduce): 21 apps System parameters (Fermi): 15 SMs, 32 threads/warp 48 warps/SM, 32768 registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler, 2 schedulers/SM Memory: 177.4GB/s BW, GDDR5 Cache: L1 - 16KB; L2 - 768KB 18
Effect of EC on Bit Toggle Count Without EC With EC Normalized Bit Toggle Count 2.2 2 1.8 1.6 1.4 1.2 1 0.8 FPC FPC FPC BDI BDI BDI Fibonacci Fibonacci Fibonacci C-Pack C-Pack C-Pack BDI+FPC BDI+FPC BDI+FPC Discrete Mobile Open-Source EC significantly reduces the bit toggle count Works for different compression algorithms 19
Effect of EC on Compression Ratio Without EC With EC Compression Ratio 1.8 1.6 1.4 1.2 1 0.8 FPC FPC FPC Fibonacci Fibonacci Fibonacci BDI C-Pack BDI C-Pack BDI C-Pack BDI+FPC BDI+FPC BDI+FPC Discrete Mobile Open-Source EC preserves most of the benefits of compression 20
Bit Toggles for C-Pack Algorithm 10 Normalized Bit Toggle Count Without EC EC 9 8 7 6 5 4 3 2 1 0 BFS MUM mst CUDA TRA bh sp nw bfs JPEG RAY lonestar sssp MatrixMul PageViewCount GeoMean CONS LPS SLA Mars FWT SimilarityScore rodinia heartwall Kmeans PageViewRank Different tradeoffs for different applications 21
DRAM Energy for C-Pack Normalized DRAM Energy 3 Without EC EC 2.5 2 1.5 1 0.5 0 Kmeans BFS bfs sp MUM bh mst nw CUDA TRA rodinia GeoMean CONS LPS lonestar FWT JPEG RAY MatrixMul PageViewCount sssp SLA SimilarityScore heartwall Mars PageViewRank 7% average DRAM energy reduction, up to 28% for TRA 22
Effect of Metadata Consolidation (MC) FPC compression algorithm Without EC MC EC MC+EC Normalized Bit Toggle Count 7 6 5 4 3 2 1 0 Kmeans bfs BFS sp GeoMean MUM bh mst nw CUDA TRA rodinia CONS JPEG LPS lonestar FWT RAY MatrixMul PageViewCount sssp SLA SimilarityScore heartwall Mars PageViewRank MC is effective in reducing the bit toggle count But less effective than EC 23
Other Results in the Paper On-chip interconnect results Higher impact of bit toggles on the interconnect energy, but lower overall energy impact Data bus inversion (DBI) EC and MC benefits are independent on whether DBI encoding is used Complexity estimation Energy and latency Analyzing different EC decision functions Energy x Delay vs. Energy x Delay2 24
Conclusion Data compression is a known technique to decrease the bandwidth pressure Observation: Compression significantly increases the energy cost of communication by increasing the number of bit toggles (bit flips) Our approach: Toggle-Aware Compression Energy Control (EC): send compressed data only when it is beneficial Metadata Consolidation (MC): consolidate metadata bits to reduce the bit toggle count Key results: 2.2X increase in bit toggles reduced to only 1.1X with most of the performance benefits preserved 25
A Case for Toggle-Aware Compression for GPU Systems Gennady Pekhimenko, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry Evgeny Bolotin, Stephen W. Keckler