Exploring Overlay Architecture for Efficient Embedded Processing

Slide Note
Embed
Share

The research delves into the implementation of overlay architecture for embedded processing, aiming to achieve optimal performance with minimal FPGA resource usage. It discusses motivations for utilizing FPGAs in embedded systems, the challenges of balancing performance and resource utilization, and various options such as custom accelerators and soft processors. The focus remains on overlay architecture, emphasizing flexible and reusable designs that enable fast design cycles and efficient data processing. The discussion also includes the concept of hybrid vector-SIMD processing.


Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. The University of British Columbia vector blox VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston 1

  2. Motivation FPGAs for embedded processing Exist in many embedded systems already Glue logic, high speed I/O s Ideally do data processing on-chip Reduced power, footprint, bill of materials Several goals to balance Performance (must meet, no(?) bonus for exceeding) FPGA resource usage; more -> larger FPGA needed Human resource requirements & time (cost/time to market) Ease of programming, debugging Flexibility & reusability Both short (run-time) term and long term 2

  3. Some Options: Custom accelerator High performance, low device usage Time-consuming to design and debug 1 hardware accelerator per function Hard processor Fixed number (1 currently) of relatively simple processors Fixed balance for all applications Data parallelism (SIMD) ILP, MLP, speculation, etc. High level synthesis Overlay architectures 3

  4. Our Focus Overlay architecture Start with full data parallel engine Prune unneeded functions later Fast design cycles (productivity) Compile software, don t re-run synthesis/place & route Flexible, reusable Could use HLS techniques on top of it Restricted programming model Forces programmer to write good code Can be inferred from flexible (C code) where appropriate 6

  5. Overlay Options: Soft processor: limited performance Single issue, in-order 2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA Expensive to implement CAMs for OoOE Multiprocessor-on-FPGA: complexity Parallel programming and debugging Area overhead for interconnect Cache coherence, memory consistency Soft vector processor: balance For common embedded multimedia applications Easily scalable data parallelism; just add more lanes (ALUs) 7

  6. Hybrid Vector-SIMD C 6 4 2 0 E for( i=0; i<8; i++ ) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } C E 7 3 1 5 10

  7. Previous SVP Work 1st Gen: VIPERS (UBC) & VESPA (UofT) Similar to Cray-1 / VIRAM Vector data in registers Load/store for memory accesses 2nd Gen: VEGAS (UBC) Optimized for FPGAs Vector data in flat scratchpad Addresses in scratchpad stored in separate register file Concurrent DMA engine 11

  8. VEGAS (2nd Gen) Architecture Vector Core: VEGAS Scalar Core: NiosII/f Concurrent execution: in order dispatch, explicit syncing VEGAS DMA engine: to/from external memory 12

  9. VENICE Architecture 3rd Generation Optimized for Embedded Multimedia Applications High Frequency, Low Area 14

  10. VENICE Overview Vector Extensions to NIOS Implemented Compactly and Elegantly Continues from VEGAS (2nd Gen) Scratchpad memory Asynchronous DMA transactions Sub-word SIMD (1x32, 2x16, 4x8 bit operations) Optimized for smaller implementations VEGAS achieves best performance/area at 4-8 lanes Vector programs don t scale indefinitely Communications networks scale > O(N) VENICE targets 1-4 lanes About 50% .. 75% of the size of VEGAS 15

  11. VENICE Architecture 16

  12. Selected Contributions In-pipeline vector alignment network No need for separate move/element shift instructions Increased performance for sliding windows Ease of programming/compiling (data packing) Only shift/rotate vector elements Scales O( N * log(N) ) Acceptable to use multiple networks for few (1 to 4) lanes 17

  13. Scratchpad Alignment 18

  14. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Set # of repeated ops, increment on sources and dest Increased instruction dispatch rate for multimedia Replaces increment/window of address registers (2nd Gen) Had to be implemented in registers instead of BRAM Modified 3 registers per cycle 19

  15. VENICE Programming FIR using 2D Vector Ops int num_taps, num_samples; int16_t *v_output, *v_coeffs, *v_input; // Set up 2D vector parameters: vector_set_vl( num_taps ); // inner loop count vector_set_2D( num_samples, // outer loop count 1*sizeof(int16_t), // dest gap (-num_taps )*sizeof(int16_t), // srcA gap (-num_taps+1)*sizeof(int16_t) ); // srcB gap // Execute instruction; does 2D loop over entire input, multiplying and accumulating vector_acc_2D( VVH, VMULLO, v_output, v_coeffs, v_input ); 20

  16. Area Breakdown VENICE: Lower control & overall area ICN (alignment) scales faster but does not dominate 21

  17. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction Previous SVPs had separate flag registers/scratchpad Instead, encode a flag bit with each byte/halfword/word Can be stored in 9th bit of 9/18/36 bit wide BRAMs 22

  18. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation To support 1x32-bit, 2x16-bit, 4x8-bit multiplies 23

  19. Fracturable Multipliers in Stratix III/IV DSPs 18x18 32 byte 0 / halfword 0 byte 1 36x36 64 byte 0 / halfword 0 / word 16 18x18 18x18 32 byte 1 / halfword 1 64 word <<16 9x9 16 byte 2 16 byte 2 18x18 32 9x9 16 byte 3 / halfword 1 <<32 byte 3 18x18 VEGAS Fracturable Multiplier efficiently packs into 2 DSP blocks VENICE Fracturable Multiplier saves ALMs, packs inefficiently into 3 DSP blocks 2 DSP Blocks + extra logic 2 DSP Blocks (no extra logic) 24

  20. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation Increased frequency 200MHz+ vs ~125MHz of previous SVPs Deeper pipelining Optimized circuit design 25

  21. Table III BENCHMARK PERFORMANCE AND PROPERTIES Performance (Millions of elem. per second) Nios II/f V1 0.46 5.94 4.56 17.68 5.20 6.74 4.83 77.63 2.11 16.82 0.10 0.74 0.09 2.37 3.32 20.11 11.7 148.20 Speedup V2 24.2 4.7 2.1 30.1 12.7 14.4 48.2 10.5 27.4 13.8 Data Type In/Out halfword byte byte halfword byte byte byte halfword word Benchmark Properties Data Set Size 1024 896 606 896 606 320 240 320 240 128 21 32 32 4096 1024 1024 Benchmark autocor rgbcmyk rgbyiq imgblend filt3x3 median motest fir matmul V2 V4 V1 12.9 3.9 1.3 16.1 8.0 7.3 27.4 6.1 12.6 7.95 V4 41.2 5.0 3.0 52.0 17.2 26.6 72.4 12.5 50.6 20.6 Intermed. word Taps 16 Origin EEMBC EEMBC EEMBC VIRAM VIRAM custom custom custom custom 11.11 21.41 11.09 145.57 26.95 1.45 4.18 34.95 322.22 18.94 22.72 15.61 251.18 36.42 2.69 6.29 41.67 593.75 Geomean word halfword 3 3 5 5 16 16 16 Average Speedup vs. ALMs Speedup& (rela+ve& to& Nios& II/f,& 25" geomean& of& benchmarks)& 20" Nios"II/f" Single"CPU" Nios"II/f" Ideal"Scaling" VEGAS" V1,"V2,"V4" VENICE" V1,"V2,"V4" 15" 10" 5" 0" 0" 2" 4" 6" 8" Area& (rela+ve& to& Nios& II/f,& ALM& count)& 26 Figure 6. Speedup (geomean of 9 Benchmarks) vs Area Scaling Figure 8. 16-bit 4x4 DCT fallsbelow 1.0 on VENICE, meaning NiosII/f isbetter. This is because the area overhead of 1.8 exceeds the speedup of 1.3 . The limited speedup is due to a combination of memory access patterns (r,g,b triplets) and wide intermedi- ate data (32b) to prevent overflows. However, on average, VENICE-V1 offers3.8 greater computational density than Nios II/f, and 2.3 greater density than VEGAS-V1. Comparing V4 configuration results (not shown), the computational density of VENICE is 5.2 , while VEGAS is 2.7 that of Nios. Figure 7. Computational Density with V1 SVPs D. Case Study: DCT VENICE wasdesigned to exploit vector parallelism, even when vectorsareshort. By remaining small, VENICE can be efficiently deployed in multiprocessor systems to efficiently exploit other forms of parallelism (eg, thread-level) on top of the vector-level parallelism. In this section, we use VENICE to perform a 4x4 DCT with 16-bit elements on a total of 8192 different matrices. Each DCT is implemented using two matrix multiplies followed by shifts for normalization. Vectors are short, limited to four halfwords for the matrix multiply. In Figure 8, the first set of bars shows the benefit of using 2D and 3D vector operations with a V2 VENICE VEGAS and VENICE using a small V1 configuration. Simple benchmarks such as rgbcmyk, rgbyiq, imgblend and median achieve the smallest performance increase over VEGAS. These benchmarks have large vector lengths and no misaligned vectors, and so the speedup comes mostly from clock speed increase. Convolution benchmarks like fir andautocor benefit from thelack of misalignment penalty on VENICE. The2D vectorsaccelerate autocor, motest, and fir. On matmul, using 3D vectorsand theaccumulators achieves 3.2 the performance of VEGAS. For one application, r gbyi q, the computational density 7

  22. Table III BENCHMARK PERFORMANCE AND PROPERTIES Performance (Millions of elem. per second) Nios II/f V1 0.46 5.94 4.56 17.68 5.20 6.74 4.83 77.63 2.11 16.82 0.10 0.74 0.09 2.37 3.32 20.11 11.7 148.20 Speedup V2 24.2 4.7 2.1 30.1 12.7 14.4 48.2 10.5 27.4 13.8 Data Type In/Out halfword byte byte halfword byte byte byte halfword word Benchmark Properties Data Set Size 1024 896 606 896 606 320 240 320 240 128 21 32 32 4096 1024 1024 Benchmark autocor rgbcmyk rgbyiq imgblend filt3x3 median motest fir matmul V2 V4 V1 12.9 3.9 1.3 16.1 8.0 7.3 27.4 6.1 12.6 7.95 V4 41.2 5.0 3.0 52.0 17.2 26.6 72.4 12.5 50.6 20.6 Intermed. word Taps 16 Origin EEMBC EEMBC EEMBC VIRAM VIRAM custom custom custom custom 11.11 21.41 11.09 145.57 26.95 1.45 4.18 34.95 322.22 18.94 22.72 15.61 251.18 36.42 2.69 6.29 41.67 593.75 Geomean word halfword 3 3 5 5 16 16 16 Computational Density Figure 6. Speedup (geomean of 9 Benchmarks) vs Area Scaling 100$ Figure 8. 16-bit 4x4 DCT VEGAS?V1$ VENICE?V1$ (rela. ve& to& Nios& II/f)& Speedup& per& ALM& fallsbelow 1.0 on VENICE, meaning NiosII/f isbetter. This is because the area overhead of 1.8 exceeds the speedup of 1.3 . The limited speedup is due to a combination of memory access patterns (r,g,b triplets) and wide intermedi- ate data (32b) to prevent overflows. However, on average, VENICE-V1 offers3.8 greater computational density than Nios II/f, and 2.3 greater density than VEGAS-V1. Comparing V4 configuration results (not shown), the computational density of VENICE is 5.2 , while VEGAS is 2.7 that of Nios. 10$ & 1$ autcor$ rgbcmyk$ rgbyiq$ imgblend$ filt3x3$ median$ motest$ int$ matmul$ fir$ geomean$ 0.1$ 27 Figure 7. Computational Density with V1 SVPs D. Case Study: DCT VENICE wasdesigned to exploit vector parallelism, even when vectorsareshort. By remaining small, VENICE can be efficiently deployed in multiprocessor systems to efficiently exploit other forms of parallelism (eg, thread-level) on top of the vector-level parallelism. In this section, we use VENICE to perform a 4x4 DCT with 16-bit elements on a total of 8192 different matrices. Each DCT is implemented using two matrix multiplies followed by shifts for normalization. Vectors are short, limited to four halfwords for the matrix multiply. In Figure 8, the first set of bars shows the benefit of using 2D and 3D vector operations with a V2 VENICE VEGAS and VENICE using a small V1 configuration. Simple benchmarks such as rgbcmyk, rgbyiq, imgblend and median achieve the smallest performance increase over VEGAS. These benchmarks have large vector lengths and no misaligned vectors, and so the speedup comes mostly from clock speed increase. Convolution benchmarks like fir andautocor benefit from thelack of misalignment penalty on VENICE. The2D vectorsaccelerate autocor, motest, and fir. On matmul, using 3D vectorsand theaccumulators achieves 3.2 the performance of VEGAS. For one application, r gbyi q, the computational density 7

  23. Future Work Scatter/gather functionality Vector indexed memory accesses Need to coalesce to get good bandwidth utilization Automatic pruning of unneeded overlay functionality HLS in reverse Hybrid HLS overlays Start from overlay and synthesize in extra functionality? Overlay + custom instructions? 29

  24. Conclusions Soft Vector Processors Scalable performance No hardware recompiling necessary VENICE Optimized for FPGAs, 1 to 4 lanes 5X Performance/Area of Nios II/f on multimedia benchmarks 30

Related