Overlay Architecture for Efficient Embedded Processing

Aaron Severance

Advised by Prof. Guy Lemieux

Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant,

Maxime Perreault, Chris Eagleston

The University of British Columbia

Motivation



FPGAs for embedded processing



Exist in many embedded systems already



Glue logic, high speed I/O’s



Ideally do data processing on-chip



Reduced power, footprint, bill of materials



Several goals to balance



Performance (must meet, no(?) bonus for exceeding)



FPGA resource usage; more -> larger FPGA needed



Human resource requirements & time (cost/time to market)



Ease of programming, debugging



Flexibility & reusability



Both short (run-time) term and long term

Some Options:



Custom accelerator



High performance, low device usage



Time-consuming to design and debug



1 hardware accelerator per function



Hard processor



Fixed number (1 currently) of relatively simple processors



Fixed balance for all applications



Data parallelism (SIMD)



ILP, MLP, speculation, etc.



High level synthesis



Overlay architectures

HLS vs. Overlays



High level synthesis (bottom up)



Build from sequential code and make faster



Only infer needed functions



Run synthesis (minutes to hours) after algorithm change



Overlay architectures (top down)



Build from everything, prune unneeded paths later



Difficult to prune to bare minimum of resources



Recompile program (seconds) after algorithm change

Programming Model



Both HLS and Overlays have restricted or flexible options



Restricted



Fixed starting point (e.g. C with vector intrinsics)



Code can be rewritten to fit the model?



Yes: Good performance



No:  Give up? (run on soft/hard processor)



Flexible



Arbitrary starting point (e.g. sequential C code)



Code can be (is) rewritten to fit possible architectures?



Yes: Good performance



No:  Poor performance

Our Focus



Overlay architecture



Start with full data parallel engine



Prune unneeded functions later



Fast design cycles (productivity)



Compile software, don’t re-run synthesis/place & route



Flexible, reusable



Could use HLS techniques on top of it…



Restricted programming model



Forces programmer to write ‘good’ code



Can be inferred from flexible (C code) where appropriate

Overlay Options:



Soft processor:

limited performance



Single issue, in-order



2 or 4-way superscalar/VLIW register file maps inefficiently to

FPGA



Expensive to implement CAMs for OoOE



Multiprocessor-on-FPGA:

complexity



Parallel programming and debugging



Area overhead for interconnect



Cache coherence, memory consistency



Soft vector processor:

balance



For common embedded multimedia applications



Easily scalable data parallelism; just add more lanes (ALUs)

Soft Vector Processor



Change algorithm



 same RTL, just recompile software



Simple programming model



Data-level parallelism, exposed to the programmer



One hardware accelerator supports many applications



Scalable performance and area



Write once, run anywhere…



Small FPGA: 1 ALU     (smaller than Nios II/f)



Large FPGA: 10’s to 100’s of parallel ALUs



Parameterizable; remove unused functions to save area

Vector Processing



Organize data as long vectors



Replace inner loop with vector instruction



Hardware loops until all elements are processed



May execute repeatedly (sequentially)



May process multiple at once (in parallel)

for( i=0; i<N; i++ )

  a[i] = b[i] * c[i]

set vl, N

vmult  a, b, c

Hybrid Vector-SIMD

for( i=0; i<8; i++ ) {

    C[i] = A[i] + B[i]

    E[i] = C[i] * D[i]

Previous SVP Work



st

 Gen: VIPERS (UBC) & VESPA (UofT)



Similar to Cray-1 / VIRAM



Vector data in registers



Load/store for memory accesses



nd

 Gen: VEGAS (UBC)



Optimized for FPGAs



Vector data in flat scratchpad



Addresses in scratchpad stored in separate register file



Concurrent DMA engine

VEGAS (2

nd

 Gen) Architecture

VEGAS (2

nd

 Gen):

Efficient On-Chip Memory Use



Scratchpad-based “register file” (2kB..2MB)



Vector address register file stores vector locations



Very long vectors (no maximum length)



Any number of vectors

(no set number of “vector data registers”)



Double-pumped to achieve 4 ports



2 Read, 1 Write, 1 DMA Read/Write



Lanes support sub-word SIMD



32-bit ALU configurable as 1x32, 2x16, 4x8



Keeps data packed in scratchpad for larger working set

VENICE Architecture

3rd Generation

Optimized for Embedded Multimedia Applications

High Frequency, Low Area

VENICE Overview

Vector Extensions to NIOS Implemented Compactly and Elegantly



Continues from VEGAS (2

nd

 Gen)



Scratchpad memory



Asynchronous DMA transactions



Sub-word SIMD (1x32, 2x16, 4x8 bit operations)



Optimized for smaller implementations



VEGAS achieves best performance/area at 4-8 lanes



Vector programs don’t scale indefinitely



Communications networks scale > O(N)



VENICE targets 1-4 lanes



About 50% .. 75% of the size of VEGAS

VENICE

Architecture

Selected Contributions



In-pipeline vector alignment network



No need for separate move/element shift instructions



Increased performance for sliding windows



Ease of programming/compiling (data packing)



Only shift/rotate vector elements



Scales O( N * log(N) )



Acceptable to use multiple networks for few (1 to 4) lanes

Scratchpad Alignment

Selected Contributions



In-pipeline vector alignment network



2D/3D vector operations



Set # of repeated ops, increment on sources and dest



Increased instruction dispatch rate for multimedia



Replaces increment/window of address registers (2

nd

 Gen)



Had to be implemented in registers instead of BRAM



Modified 3 registers per cycle

VENICE Programming

FIR using 2D Vector Ops

int num_taps, num_samples;  int16_t *v_output, *v_coeffs, *v_input;

// Set up 2D vector parameters:

vector_set_vl( num_taps ); // inner loop count

vector_set_2D( num_samples, // outer loop count

               1*sizeof(int16_t), // dest gap

               (-num_taps )*sizeof(int16_t), // srcA gap

               (-num_taps+1)*sizeof(int16_t) ); // srcB gap

// Execute instruction; does 2D loop over entire input, multiplying and

accumulating

vector_acc_2D( VVH, VMULLO, v_output, v_coeffs, v_input );

Area Breakdown

VENICE: Lower control & overall area

ICN (alignment) scales faster but does not dominate

Selected Contributions



In-pipeline vector alignment network



2D/3D vector operations



Single flag/condition code per instruction



Previous SVPs had separate flag registers/scratchpad



Instead, encode a flag bit with each byte/halfword/word



Can be stored in 9

th

 bit of 9/18/36 bit wide BRAMs

Selected Contributions



In-pipeline vector alignment network



2D/3D vector operations



Single flag/condition code per instruction



More efficient hybrid multiplier implementation



To support 1x32-bit, 2x16-bit, 4x8-bit multiplies

Fracturable Multipliers

in Stratix III/IV DSPs

2 DSP Blocks

+ extra logic

2 DSP Blocks (

no extra logic

Selected Contributions



In-pipeline vector alignment network



2D/3D vector operations



Single flag/condition code per instruction



More efficient hybrid multiplier implementation



Increased frequency



200MHz+ vs ~125MHz of previous SVPs



Deeper pipelining



Optimized circuit design

Average Speedup vs. ALMs

Computational Density

Speeding up the

4x4 DCT (16-bit)

Future Work



Scatter/gather functionality



Vector indexed memory accesses



Need to coalesce to get good bandwidth utilization



Automatic pruning of unneeded overlay functionality



HLS in reverse



Hybrid HLS



overlays



 Start from overlay and synthesize in extra functionality?



Overlay + custom instructions?

Conclusions



Soft Vector Processors



Scalable performance



No hardware recompiling necessary



VENICE



Optimized for FPGAs, 1 to 4 lanes



5X Performance/Area of Nios II/f on multimedia

benchmarks

Slide Note

Embed Share

Download

The research delves into the implementation of overlay architecture for embedded processing, aiming to achieve optimal performance with minimal FPGA resource usage. It discusses motivations for utilizing FPGAs in embedded systems, the challenges of balancing performance and resource utilization, and various options such as custom accelerators and soft processors. The focus remains on overlay architecture, emphasizing flexible and reusable designs that enable fast design cycles and efficient data processing. The discussion also includes the concept of hybrid vector-SIMD processing.

thorburn_h Follow

Uploaded on Oct 09, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

The University of British Columbia vector blox VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston 1

Motivation FPGAs for embedded processing Exist in many embedded systems already Glue logic, high speed I/O s Ideally do data processing on-chip Reduced power, footprint, bill of materials Several goals to balance Performance (must meet, no(?) bonus for exceeding) FPGA resource usage; more -> larger FPGA needed Human resource requirements & time (cost/time to market) Ease of programming, debugging Flexibility & reusability Both short (run-time) term and long term 2

Some Options: Custom accelerator High performance, low device usage Time-consuming to design and debug 1 hardware accelerator per function Hard processor Fixed number (1 currently) of relatively simple processors Fixed balance for all applications Data parallelism (SIMD) ILP, MLP, speculation, etc. High level synthesis Overlay architectures 3

Our Focus Overlay architecture Start with full data parallel engine Prune unneeded functions later Fast design cycles (productivity) Compile software, don t re-run synthesis/place & route Flexible, reusable Could use HLS techniques on top of it Restricted programming model Forces programmer to write good code Can be inferred from flexible (C code) where appropriate 6

Overlay Options: Soft processor: limited performance Single issue, in-order 2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA Expensive to implement CAMs for OoOE Multiprocessor-on-FPGA: complexity Parallel programming and debugging Area overhead for interconnect Cache coherence, memory consistency Soft vector processor: balance For common embedded multimedia applications Easily scalable data parallelism; just add more lanes (ALUs) 7

Hybrid Vector-SIMD C 6 4 2 0 E for( i=0; i<8; i++ ) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } C E 7 3 1 5 10

Previous SVP Work 1st Gen: VIPERS (UBC) & VESPA (UofT) Similar to Cray-1 / VIRAM Vector data in registers Load/store for memory accesses 2nd Gen: VEGAS (UBC) Optimized for FPGAs Vector data in flat scratchpad Addresses in scratchpad stored in separate register file Concurrent DMA engine 11

VEGAS (2nd Gen) Architecture Vector Core: VEGAS Scalar Core: NiosII/f Concurrent execution: in order dispatch, explicit syncing VEGAS DMA engine: to/from external memory 12

VENICE Architecture 3rd Generation Optimized for Embedded Multimedia Applications High Frequency, Low Area 14

VENICE Overview Vector Extensions to NIOS Implemented Compactly and Elegantly Continues from VEGAS (2nd Gen) Scratchpad memory Asynchronous DMA transactions Sub-word SIMD (1x32, 2x16, 4x8 bit operations) Optimized for smaller implementations VEGAS achieves best performance/area at 4-8 lanes Vector programs don t scale indefinitely Communications networks scale > O(N) VENICE targets 1-4 lanes About 50% .. 75% of the size of VEGAS 15

VENICE Architecture 16

Selected Contributions In-pipeline vector alignment network No need for separate move/element shift instructions Increased performance for sliding windows Ease of programming/compiling (data packing) Only shift/rotate vector elements Scales O( N * log(N) ) Acceptable to use multiple networks for few (1 to 4) lanes 17

Scratchpad Alignment 18

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Set # of repeated ops, increment on sources and dest Increased instruction dispatch rate for multimedia Replaces increment/window of address registers (2nd Gen) Had to be implemented in registers instead of BRAM Modified 3 registers per cycle 19

VENICE Programming FIR using 2D Vector Ops int num_taps, num_samples; int16_t *v_output, *v_coeffs, *v_input; // Set up 2D vector parameters: vector_set_vl( num_taps ); // inner loop count vector_set_2D( num_samples, // outer loop count 1*sizeof(int16_t), // dest gap (-num_taps )*sizeof(int16_t), // srcA gap (-num_taps+1)*sizeof(int16_t) ); // srcB gap // Execute instruction; does 2D loop over entire input, multiplying and accumulating vector_acc_2D( VVH, VMULLO, v_output, v_coeffs, v_input ); 20

Area Breakdown VENICE: Lower control & overall area ICN (alignment) scales faster but does not dominate 21

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction Previous SVPs had separate flag registers/scratchpad Instead, encode a flag bit with each byte/halfword/word Can be stored in 9th bit of 9/18/36 bit wide BRAMs 22

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation To support 1x32-bit, 2x16-bit, 4x8-bit multiplies 23

Fracturable Multipliers in Stratix III/IV DSPs 18x18 32 byte 0 / halfword 0 byte 1 36x36 64 byte 0 / halfword 0 / word 16 18x18 18x18 32 byte 1 / halfword 1 64 word <<16 9x9 16 byte 2 16 byte 2 18x18 32 9x9 16 byte 3 / halfword 1 <<32 byte 3 18x18 VEGAS Fracturable Multiplier efficiently packs into 2 DSP blocks VENICE Fracturable Multiplier saves ALMs, packs inefficiently into 3 DSP blocks 2 DSP Blocks + extra logic 2 DSP Blocks (no extra logic) 24

Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation Increased frequency 200MHz+ vs ~125MHz of previous SVPs Deeper pipelining Optimized circuit design 25

Table III BENCHMARK PERFORMANCE AND PROPERTIES Performance (Millions of elem. per second) Nios II/f V1 0.46 5.94 4.56 17.68 5.20 6.74 4.83 77.63 2.11 16.82 0.10 0.74 0.09 2.37 3.32 20.11 11.7 148.20 Speedup V2 24.2 4.7 2.1 30.1 12.7 14.4 48.2 10.5 27.4 13.8 Data Type In/Out halfword byte byte halfword byte byte byte halfword word Benchmark Properties Data Set Size 1024 896 606 896 606 320 240 320 240 128 21 32 32 4096 1024 1024 Benchmark autocor rgbcmyk rgbyiq imgblend filt3x3 median motest fir matmul V2 V4 V1 12.9 3.9 1.3 16.1 8.0 7.3 27.4 6.1 12.6 7.95 V4 41.2 5.0 3.0 52.0 17.2 26.6 72.4 12.5 50.6 20.6 Intermed. word Taps 16 Origin EEMBC EEMBC EEMBC VIRAM VIRAM custom custom custom custom 11.11 21.41 11.09 145.57 26.95 1.45 4.18 34.95 322.22 18.94 22.72 15.61 251.18 36.42 2.69 6.29 41.67 593.75 Geomean word halfword 3 3 5 5 16 16 16 Average Speedup vs. ALMs Speedup& (rela+ve& to& Nios& II/f,& 25" geomean& of& benchmarks)& 20" Nios"II/f" Single"CPU" Nios"II/f" Ideal"Scaling" VEGAS" V1,"V2,"V4" VENICE" V1,"V2,"V4" 15" 10" 5" 0" 0" 2" 4" 6" 8" Area& (rela+ve& to& Nios& II/f,& ALM& count)& 26 Figure 6. Speedup (geomean of 9 Benchmarks) vs Area Scaling Figure 8. 16-bit 4x4 DCT fallsbelow 1.0 on VENICE, meaning NiosII/f isbetter. This is because the area overhead of 1.8 exceeds the speedup of 1.3 . The limited speedup is due to a combination of memory access patterns (r,g,b triplets) and wide intermedi- ate data (32b) to prevent overflows. However, on average, VENICE-V1 offers3.8 greater computational density than Nios II/f, and 2.3 greater density than VEGAS-V1. Comparing V4 configuration results (not shown), the computational density of VENICE is 5.2 , while VEGAS is 2.7 that of Nios. Figure 7. Computational Density with V1 SVPs D. Case Study: DCT VENICE wasdesigned to exploit vector parallelism, even when vectorsareshort. By remaining small, VENICE can be efficiently deployed in multiprocessor systems to efficiently exploit other forms of parallelism (eg, thread-level) on top of the vector-level parallelism. In this section, we use VENICE to perform a 4x4 DCT with 16-bit elements on a total of 8192 different matrices. Each DCT is implemented using two matrix multiplies followed by shifts for normalization. Vectors are short, limited to four halfwords for the matrix multiply. In Figure 8, the first set of bars shows the benefit of using 2D and 3D vector operations with a V2 VENICE VEGAS and VENICE using a small V1 configuration. Simple benchmarks such as rgbcmyk, rgbyiq, imgblend and median achieve the smallest performance increase over VEGAS. These benchmarks have large vector lengths and no misaligned vectors, and so the speedup comes mostly from clock speed increase. Convolution benchmarks like fir andautocor benefit from thelack of misalignment penalty on VENICE. The2D vectorsaccelerate autocor, motest, and fir. On matmul, using 3D vectorsand theaccumulators achieves 3.2 the performance of VEGAS. For one application, r gbyi q, the computational density 7

Table III BENCHMARK PERFORMANCE AND PROPERTIES Performance (Millions of elem. per second) Nios II/f V1 0.46 5.94 4.56 17.68 5.20 6.74 4.83 77.63 2.11 16.82 0.10 0.74 0.09 2.37 3.32 20.11 11.7 148.20 Speedup V2 24.2 4.7 2.1 30.1 12.7 14.4 48.2 10.5 27.4 13.8 Data Type In/Out halfword byte byte halfword byte byte byte halfword word Benchmark Properties Data Set Size 1024 896 606 896 606 320 240 320 240 128 21 32 32 4096 1024 1024 Benchmark autocor rgbcmyk rgbyiq imgblend filt3x3 median motest fir matmul V2 V4 V1 12.9 3.9 1.3 16.1 8.0 7.3 27.4 6.1 12.6 7.95 V4 41.2 5.0 3.0 52.0 17.2 26.6 72.4 12.5 50.6 20.6 Intermed. word Taps 16 Origin EEMBC EEMBC EEMBC VIRAM VIRAM custom custom custom custom 11.11 21.41 11.09 145.57 26.95 1.45 4.18 34.95 322.22 18.94 22.72 15.61 251.18 36.42 2.69 6.29 41.67 593.75 Geomean word halfword 3 3 5 5 16 16 16 Computational Density Figure 6. Speedup (geomean of 9 Benchmarks) vs Area Scaling 100$ Figure 8. 16-bit 4x4 DCT VEGAS?V1$ VENICE?V1$ (rela. ve& to& Nios& II/f)& Speedup& per& ALM& fallsbelow 1.0 on VENICE, meaning NiosII/f isbetter. This is because the area overhead of 1.8 exceeds the speedup of 1.3 . The limited speedup is due to a combination of memory access patterns (r,g,b triplets) and wide intermedi- ate data (32b) to prevent overflows. However, on average, VENICE-V1 offers3.8 greater computational density than Nios II/f, and 2.3 greater density than VEGAS-V1. Comparing V4 configuration results (not shown), the computational density of VENICE is 5.2 , while VEGAS is 2.7 that of Nios. 10$ & 1$ autcor$ rgbcmyk$ rgbyiq$ imgblend$ filt3x3$ median$ motest$ int$ matmul$ fir$ geomean$ 0.1$ 27 Figure 7. Computational Density with V1 SVPs D. Case Study: DCT VENICE wasdesigned to exploit vector parallelism, even when vectorsareshort. By remaining small, VENICE can be efficiently deployed in multiprocessor systems to efficiently exploit other forms of parallelism (eg, thread-level) on top of the vector-level parallelism. In this section, we use VENICE to perform a 4x4 DCT with 16-bit elements on a total of 8192 different matrices. Each DCT is implemented using two matrix multiplies followed by shifts for normalization. Vectors are short, limited to four halfwords for the matrix multiply. In Figure 8, the first set of bars shows the benefit of using 2D and 3D vector operations with a V2 VENICE VEGAS and VENICE using a small V1 configuration. Simple benchmarks such as rgbcmyk, rgbyiq, imgblend and median achieve the smallest performance increase over VEGAS. These benchmarks have large vector lengths and no misaligned vectors, and so the speedup comes mostly from clock speed increase. Convolution benchmarks like fir andautocor benefit from thelack of misalignment penalty on VENICE. The2D vectorsaccelerate autocor, motest, and fir. On matmul, using 3D vectorsand theaccumulators achieves 3.2 the performance of VEGAS. For one application, r gbyi q, the computational density 7

Future Work Scatter/gather functionality Vector indexed memory accesses Need to coalesce to get good bandwidth utilization Automatic pruning of unneeded overlay functionality HLS in reverse Hybrid HLS overlays Start from overlay and synthesize in extra functionality? Overlay + custom instructions? 29

Conclusions Soft Vector Processors Scalable performance No hardware recompiling necessary VENICE Optimized for FPGAs, 1 to 4 lanes 5X Performance/Area of Nios II/f on multimedia benchmarks 30

Overlay Architecture for Efficient Embedded Processing

Download Presentation

Presentation Transcript

Related

More Related Content