
Tailoring SIMD Execution Using Heterogeneous Hardware
Explore the convergence of functionalities and challenges in current mobile solutions, focusing on the development of a unified accelerator for scientific computing, wireless communication, and image processing. Dive into traditional homogeneous SIMD standards and the exploration of low resource utilization to enhance performance efficiency.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Libra:Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability Yongjun Park1, Jason Jong Kyu Park1 , Hyunchul Park2, and Scott Mahlke1 December 3, 2012 1University of Michigan, Ann Arbor 2Programming Systems Lab, Intel Labs, Santa Clara, CA University of Michigan 1 Electrical Engineering and Computer Science
Convergence of Functionalities Flexible Accelerator! 4G Wireless Audio Video 3D Navigation Anatomy of an iPhone4 Convergence of functionalities demands a flexible solution due to the design cost and programmability University of Michigan 2 Electrical Engineering and Computer Science
Current Mobile Solutions & Challenges DLP-based ILP-based ULP GeForce Adreno 320 ARM Mali-400 MP4 1.6 GHz ARM Cortex-A9 1.7 GHz Krait 1.6 GHz ARM Cortex-A9 Good for ILP Good for DLP legacy workloads media processing web browsing Goal: Design of a unified accelerator with: scientific computing wireless communication Image processing 1. Scalability 2. Flexible execution support 3. Energy efficiency Mixture of ILP/DLP University of Michigan 3 Electrical Engineering and Computer Science
Traditional Homogeneous SIMD Standard high performance machine for embedded systems Industry: IBM Cell, ARM NEON, Intel MIC, etc. Research: SODA, AnySp, etc. Advantage High throughput Low fetch-decode overhead Easy to scale Disadvantage Hard to realize high resource utilization Example SIMD machine: 100 MOps /mW Advanced goal: map broader range of applications into SIMD! 4 University of Michigan Electrical Engineering and Computer Science
Exploration of Low Resource Utilization AAC decoder Input 1 1 Application Execution time ratio Execution time ratio 0.9 0.9 0.8 0.8 0.7 Huffman decoding 0.7 0.6 Loop Acyclic 0.5 0.4 0.6 0.5 for ( ) { 0.4 Inverse Quantization 0.3 0.3 } 0.2 0.2 Non-DLP DLP 0.1 0.1 for ( ) { 0 0 IMDCT Vision Vision Media Media Game Physics Physics Game Avg Avg } Low-DLP low-DLP Execution Time Breakdown @ 1-issue in-order core High-DLP non-DLP non-DLP high-DLP high-DLP low-DLP output High execution ratio on high data-parallel loops (~80%) Traditional wide SIMD accelerator is frequently over-designed The performance is limited by the non-high-DLP loops 5 University of Michigan Electrical Engineering and Computer Science
Additional Flexibility on SIMD Program flow Non-DLP loop Non-DLP loop DLP loop SIMD Distributed VLIW Control Control Control RF RF RF RF FU FU FU FU University of Michigan 6 Electrical Engineering and Computer Science
Additional Flexibility on SIMD Libra Traditional SIMD 1 2 4 8 16 for ( ) { 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 } DLP = 4 DLP = 8 DLP = 16 8 9 10 11 12 13 14 15 DLP = 4 DLP = 8 DLP = 16 DLP = 1 ILP = 16 Total = 16 Total = 16 Total = 16 Total = 16 Total = 16 DLP = 2 ILP = 8 ILP = 4 ILP = 2 ILP = 1 DLP = 1 ILP = 1 Total: 1 Total: 2 Total: 4 Total: 8 Total: 16 DLP = 2 ILP = 1 ILP = 1 ILP = 1 ILP = 1 8 9 10 11 12 13 14 15 Each logical lane has own ILP capability The ILP capability is decided based on SIMD capability Total degree of parallelism is consistent All resources are utilized Full DLP mode Full ILP mode Hybrid mode University of Michigan 7 Electrical Engineering and Computer Science
Looks Good, but Too Expensive! Control Control Control Control Control Control Control Control RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU University of Michigan 8 Electrical Engineering and Computer Science
Opportunity: Resource Utilization Loop distribution over static ratio of multiply and memory instructions 1 for ( ) { 0.8 } Multiply Ratio 0.6 Small fraction of mul/mem instructions 0.4 0.2 0 0 0.1 0.2 0.3 0.4 Memory ratio 0.5 0.6 0.7 0.8 0.9 1 Resource over-provision: Lane uniformity incurs inefficiency Each SIMD lane provides the same functionalities Only 32% (memory) and 16% (multiplication) of total dynamic instructions More complex design, more static power consumption High variation in the resource requirements of loops Simple sharing leads to performance degradation University of Michigan 9 Electrical Engineering and Computer Science
Adapting Heterogeneity (Homogeneous SIMD) High DLP, 1 Multiplication 4-way SIMD w/ 4 multipliers IPC = 4 SIMD Lane Lane 0 Lane 1 Lane 2 Lane 3 M3 A0 A1 A2 ADD ADD 1 0 M3 A0 A1 A2 ADD 2 M3 A0 A1 A2 Mul 3 M3 A0 A1 A2 Cycle University of Michigan 10 Electrical Engineering and Computer Science
Adapting Heterogeneity (Heterogeneous SIMD) High DLP, 1 Multiplication 4-way SIMD w/ 1 multiplier IPC = 2.29 SIMD Lane Lane 0 Lane 1 Lane 2 Lane 3 M3 M3 M3 M3 A0 A1 A2 M3 A0 A1 A2 M3 A0 A1 A2 Stall!! M3 A0 A1 A2 Cycle University of Michigan 11 Electrical Engineering and Computer Science
Adapting Heterogeneity (Heterogeneous SIMD + Flexibility) High DLP, 1 Multiplication 4-way SIMD w/ 1 multiplier IPC = 4 SIMD Lane Lane 0 Lane 1 Lane 2 Lane 3 M3 M3 M3 M3 Logical lane 0 A2 A2 A2 A2 A1 A1 A1 A1 A0 A0 A0 A0 Cycle University of Michigan 12 Electrical Engineering and Computer Science
Libra: Loop-adaptive SIMD Accelerator Application Traditional SIMD Heterogeneous SIMD High-DLP loops Int Expensive unit Int Expensive unit Int Expensive unit Int Expensive unit Int Expensive unit Int Expensive unit Int Expensive unit Int Expensive unit 0 1 2 3 4 5 6 7 0 0 Low/No-DLP loops 1 2 1 ExOp-intensive loops 3 Region-adaptive execution strategy customization Key insights Heterogeneous lane structure: less power/area Dynamic configurability: change ILP/DLP capability # of logical lanes: DLP, size of a logical lane: ILP 13 University of Michigan Electrical Engineering and Computer Science
Libra Hardware Implementation Cluster 0 SIMD controller Thread controller Instruction Cache PE Group 9 SIMD controller Thread controller Inter-group Configurable Interconnect FU 0 FU 1 Loop Configuration buffer RF 0 RF 1 RF 0 RF 1 Intra-group Configurable Interconnect FU 12 FU 13 Each FU is only connected to the corresponding neighbors in adjacent PE groups Bank 0 Group 0 FU 0 FU 1 SIMD controller Thread controller Bank PE Group N Loop Configuration buffer 4x8 4x8 Crossbar PE (4n) (4(n-1)) FU (4n) Int RF (4n) Out Crossbar PE 1 (4(n+1)) (4(n-1)+1) RF 2 RF 3 FU 2 FU 3 FU RF Out (4n+1) Int+Mem (4n+1) (4(n+1)+1) (4(n-1)+2) FU RF Out (4n+2) Int+Mul (4n+2) (4(n-1)) (4(n+1)+2) (4(n-1)+3) FU RF Out (4n+3) Int (4n+3) (4(n+1)+3) Bank 2 Intra-group Interconnect Inter-group Interconnect FU0 Int RF 14 RF 5 Out RF0 Provide a loop schedule Group 1 Swizzle Network FU 14 RF 4 FU 4 FU 5 (4(n+1)) Bank 3 (4(n-1)+1) FU 15 RF 15 RF 6 RF 7 Crossbar Group 2 FU 6 FU1 Out FU 7 RF1 Int+Mem RF 16 RF 17 RF 10 RF 11 Bank 4 (4(n+1)+1) FU 16 FU 17 RF 8 RF 9 FU 8 FU 9 (4(n-1)+2) FU2 Int+Mul Out RF2 FU 10 FU 11 (4(n+1)+2) Dense 4x8 full crossbar between FUs w/o writback (4(n-1)+3) Group 3 RF 12 RF 13 FU3 Int RF 30 RF 14 RF 15 Out RF3 (4(n+1)+3) FU 30 FU 14 FU 15 1. Integer ALUs in all 4 FUs 2. One multiplier and memory unit per PE group Bank 7 Intra-group Interconnect Inter-group Interconnect FU 31 RF 31 Fully distributed nature including FUs, register files, and interconnections No dynamic routing logic: all communications statically generated University of Michigan 14 Electrical Engineering and Computer Science
Resource Sharing @ Full DLP Mode Logical Lane 0 Logical Lane 1 Logical Lane 0 PE0 cycle PE3 PE1 PE2 A0 B0 A1 B1 1 ADD ADD 2 3 4 5 6 C0 C1 A B D0 C Mul D1 7 8 9 D Add 10 Logical Lane 1 2-wide transfer & data bypass Simple hardware sharing Execute 1 cycle difference for avoiding resource contention University of Michigan 15 Electrical Engineering and Computer Science
Compilation Overview Generic C program Hardware Information Profile Information Compiler Front-end Determine SIMDizability Classifying the loop Set SIMD mode Resource allocation Modulo scheduling List scheduling w/ multi-threading Set ILP mode Code Generation Executable University of Michigan 16 Electrical Engineering and Computer Science
Experimental Setup Target applications Vision applications: SD-VBS [Venkata, IISWC '09] Media benchmark: AAC decoder, H.264 decoder, and 3D rendering Game physics benchmarks: line of sight, convolution, and conjugate Target architecture: SIMD, clustered VLIW, and Libra 16 ~ 64 heterogeneous/homogeneous resources IMPACT frontend compiler + cycle-accurate simulator Power measurement IBM SOI 45nm technology @ 500MHz/0.81V University of Michigan 17 Electrical Engineering and Computer Science
Performance with Heterogeneous Hardware Performance @ 32 heterogeneous datapath 0.4 0.4 0.35 0.35 Normalized Execution Time 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 VLIW Libra SIMD VLIW SIMD VLIW SIMD VLIW SIMD VLIW SIMD VLIW SIMD VLIW SIMD VLIW SIMD VLIW SIMD VLIW SIMD VLIW SIMD Libra Libra Libra Libra Libra Libra Libra Libra Libra Libra SIMD VLIW Libra disparity local stitch svm tracking AAC 3D H.264 lineOfS Conv Conj Avg Vision Media Game Physics SWPable (non-SIMDizable) SIMDizable Libra is 2.04x/1.38x faster than heterogeneous SIMD/VLIW University of Michigan 18 Electrical Engineering and Computer Science
Scalability with Heterogeneous Hardware 30 Normalized Performance 25 20 15 10 5 0 16 32 64 16 32 64 16 32 64 16 32 64 Vision Media Game Average SIMD VLIW Libra Libra is scalable when having enough total ILP/DLP parallelism University of Michigan 19 Electrical Engineering and Computer Science
Homogeneous SIMD vs. Heterogeneous Libra Energy consumption Power breakdown@32-PE Performance 2 1.2 Extra D-mem Control RF FU 1.2 Relative Performance 1 1.8 Relative Energy (+) Control power overhead Relative power 1 0.8 1.6 0.8 0.6 0.6 1.4 0.4 0.4 1.2 0.2 (-) FU power saving 0.2 1 0 0 16 32 64 16 32 64 homogeneous SIMD heterogeneous Libra # of PEs # of PEs Performance of Libra is better than SIMD Energy consumption shows similar trend Less expensive functional units can reduce the overall power overheads Ex. Total 11% power overheads @ 32 PEs University of Michigan 20 Electrical Engineering and Computer Science
Mode Selection Distribution of loop execution modes 100% Logical lane size Execution time ratio 80% 64 32 16 8 4 2 60% 40% 20% 0% 16 32 Vision 64 16 32 Media 64 16 32 64 16 32 Avg 64 Game Physics All available modes are used for considerable fraction The mode is selected based on application characteristics University of Michigan 21 Electrical Engineering and Computer Science
Conclusion Mobile applications consist of loops with wide range of different level of ILP and DLP. Heterogeneous SIMD lane structure can reduce the power overhead of over-provided resources. Dynamic configurability enables broader applicability. Libra outperforms traditional SIMD by 1.58x performance improvement with 29% less energy consumption on 32-PE architectures. University of Michigan 22 Electrical Engineering and Computer Science
Questions? For more information http://cccp.eecs.umich.edu University of Michigan 23 Electrical Engineering and Computer Science