Overlay Architecture for Efficient Embedded Processing

 
 
1
 
V
E
N
I
C
E
A
 
S
o
f
t
 
V
e
c
t
o
r
 
P
r
o
c
e
s
s
o
r
 
Aaron Severance
Advised by Prof. Guy Lemieux
 
Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant,
Maxime Perreault, Chris Eagleston
 
The University of British Columbia
 
Motivation
 
FPGAs for embedded processing
Exist in many embedded systems already
Glue logic, high speed I/O’s
Ideally do data processing on-chip
Reduced power, footprint, bill of materials
 
Several goals to balance
Performance (must meet, no(?) bonus for exceeding)
FPGA resource usage; more -> larger FPGA needed
Human resource requirements & time (cost/time to market)
Ease of programming, debugging
Flexibility & reusability
Both short (run-time) term and long term
 
 
2
 
Some Options:
 
Custom accelerator
High performance, low device usage
Time-consuming to design and debug
1 hardware accelerator per function
Hard processor
Fixed number (1 currently) of relatively simple processors
Fixed balance for all applications
Data parallelism (SIMD)
ILP, MLP, speculation, etc.
High level synthesis
Overlay architectures
 
3
 
HLS vs. Overlays
 
High level synthesis (bottom up)
Build from sequential code and make faster
Only infer needed functions
Run synthesis (minutes to hours) after algorithm change
 
Overlay architectures (top down)
Build from everything, prune unneeded paths later
Difficult to prune to bare minimum of resources
Recompile program (seconds) after algorithm change
 
4
 
Programming Model
 
Both HLS and Overlays have restricted or flexible options
 
Restricted
Fixed starting point (e.g. C with vector intrinsics)
Code can be rewritten to fit the model?
Yes: Good performance
No:  Give up? (run on soft/hard processor)
Flexible
Arbitrary starting point (e.g. sequential C code)
Code can be (is) rewritten to fit possible architectures?
Yes: Good performance
No:  Poor performance
 
5
 
Our Focus
 
Overlay architecture
Start with full data parallel engine
Prune unneeded functions later
Fast design cycles (productivity)
Compile software, don’t re-run synthesis/place & route
Flexible, reusable
Could use HLS techniques on top of it…
 
Restricted programming model
Forces programmer to write ‘good’ code
Can be inferred from flexible (C code) where appropriate
 
6
 
Overlay Options:
 
Soft processor: 
limited performance
Single issue, in-order
2 or 4-way superscalar/VLIW register file maps inefficiently to
FPGA
Expensive to implement CAMs for OoOE
Multiprocessor-on-FPGA: 
complexity
Parallel programming and debugging
Area overhead for interconnect
Cache coherence, memory consistency
Soft vector processor: 
balance
For common embedded multimedia applications
Easily scalable data parallelism; just add more lanes (ALUs)
 
7
 
Soft Vector Processor
 
Change algorithm 
 same RTL, just recompile software
Simple programming model
Data-level parallelism, exposed to the programmer
One hardware accelerator supports many applications
Scalable performance and area
Write once, run anywhere…
Small FPGA: 1 ALU     (smaller than Nios II/f)
Large FPGA: 10’s to 100’s of parallel ALUs
Parameterizable; remove unused functions to save area
 
8
 
Vector Processing
 
Organize data as long vectors
Replace inner loop with vector instruction
 
 
Hardware loops until all elements are processed
May execute repeatedly (sequentially)
May process multiple at once (in parallel)
 
9
 
for( i=0; i<N; i++ )
  a[i] = b[i] * c[i]
 
set vl, N
vmult  a, b, c
Hybrid Vector-SIMD
10
for( i=0; i<8; i++ ) {
    C[i] = A[i] + B[i]
    E[i] = C[i] * D[i]
}
 
0
 
1
 
2
 
3
C
E
C
E
 
4
 
5
 
6
 
7
Previous SVP Work
 
1
st
 Gen: VIPERS (UBC) & VESPA (UofT)
Similar to Cray-1 / VIRAM
Vector data in registers
Load/store for memory accesses
2
nd
 Gen: VEGAS (UBC)
Optimized for FPGAs
Vector data in flat scratchpad
Addresses in scratchpad stored in separate register file
Concurrent DMA engine
11
VEGAS (2
nd
 Gen) Architecture
 
S
c
a
l
a
r
 
C
o
r
e
:
N
i
o
s
I
I
/
f
 
D
M
A
e
n
g
i
n
e
:
t
o
/
f
r
o
m
e
x
t
e
r
n
a
l
m
e
m
o
r
y
 
V
e
c
t
o
r
 
C
o
r
e
:
V
E
G
A
S
 
C
o
n
c
u
r
r
e
n
t
e
x
e
c
u
t
i
o
n
:
i
n
 
o
r
d
e
r
d
i
s
p
a
t
c
h
,
e
x
p
l
i
c
i
t
s
y
n
c
i
n
g
12
 
VEGAS (2
nd
 Gen):
Efficient On-Chip Memory Use
 
Scratchpad-based “register file” (2kB..2MB)
Vector address register file stores vector locations
Very long vectors (no maximum length)
Any number of vectors
(no set number of “vector data registers”)
Double-pumped to achieve 4 ports
2 Read, 1 Write, 1 DMA Read/Write
 
Lanes support sub-word SIMD
32-bit ALU configurable as 1x32, 2x16, 4x8
Keeps data packed in scratchpad for larger working set
 
13
 
VENICE Architecture
 
3rd Generation
Optimized for Embedded Multimedia Applications
High Frequency, Low Area
 
14
 
VENICE Overview
Vector Extensions to NIOS Implemented Compactly and Elegantly
 
Continues from VEGAS (2
nd
 Gen)
Scratchpad memory
Asynchronous DMA transactions
Sub-word SIMD (1x32, 2x16, 4x8 bit operations)
 
Optimized for smaller implementations
VEGAS achieves best performance/area at 4-8 lanes
Vector programs don’t scale indefinitely
Communications networks scale > O(N)
VENICE targets 1-4 lanes
About 50% .. 75% of the size of VEGAS
 
15
 
VENICE
Architecture
 
16
 
Selected Contributions
 
In-pipeline vector alignment network
No need for separate move/element shift instructions
Increased performance for sliding windows
Ease of programming/compiling (data packing)
Only shift/rotate vector elements
Scales O( N * log(N) )
Acceptable to use multiple networks for few (1 to 4) lanes
 
17
 
Scratchpad Alignment
 
18
 
Selected Contributions
 
In-pipeline vector alignment network
2D/3D vector operations
Set # of repeated ops, increment on sources and dest
Increased instruction dispatch rate for multimedia
Replaces increment/window of address registers (2
nd
 Gen)
Had to be implemented in registers instead of BRAM
Modified 3 registers per cycle
 
19
 
VENICE Programming
FIR using 2D Vector Ops
 
int num_taps, num_samples;  int16_t *v_output, *v_coeffs, *v_input;
// Set up 2D vector parameters:
vector_set_vl( num_taps ); // inner loop count
vector_set_2D( num_samples, // outer loop count
               1*sizeof(int16_t), // dest gap
               (-num_taps )*sizeof(int16_t), // srcA gap
               (-num_taps+1)*sizeof(int16_t) ); // srcB gap
// Execute instruction; does 2D loop over entire input, multiplying and
accumulating
vector_acc_2D( VVH, VMULLO, v_output, v_coeffs, v_input );
 
20
 
Area Breakdown
 
VENICE: Lower control & overall area
ICN (alignment) scales faster but does not dominate
 
21
 
Selected Contributions
 
In-pipeline vector alignment network
2D/3D vector operations
Single flag/condition code per instruction
Previous SVPs had separate flag registers/scratchpad
Instead, encode a flag bit with each byte/halfword/word
Can be stored in 9
th
 bit of 9/18/36 bit wide BRAMs
 
22
 
Selected Contributions
 
In-pipeline vector alignment network
2D/3D vector operations
Single flag/condition code per instruction
More efficient hybrid multiplier implementation
To support 1x32-bit, 2x16-bit, 4x8-bit multiplies
 
23
 
Fracturable Multipliers
in Stratix III/IV DSPs
 
2 DSP Blocks 
+ extra logic
 
2 DSP Blocks (
no extra logic
)
 
24
 
Selected Contributions
 
In-pipeline vector alignment network
2D/3D vector operations
Single flag/condition code per instruction
More efficient hybrid multiplier implementation
Increased frequency
200MHz+ vs ~125MHz of previous SVPs
Deeper pipelining
Optimized circuit design
 
25
 
Average Speedup vs. ALMs
 
26
 
Computational Density
 
27
Speeding up the
4x4 DCT (16-bit)
28
 
Future Work
 
Scatter/gather functionality
Vector indexed memory accesses
Need to coalesce to get good bandwidth utilization
 
Automatic pruning of unneeded overlay functionality
HLS in reverse
Hybrid HLS

overlays
 Start from overlay and synthesize in extra functionality?
Overlay + custom instructions?
 
29
 
Conclusions
 
Soft Vector Processors
Scalable performance
No hardware recompiling necessary
VENICE
Optimized for FPGAs, 1 to 4 lanes
5X Performance/Area of Nios II/f on multimedia
benchmarks
 
30
Slide Note
Embed
Share

The research delves into the implementation of overlay architecture for embedded processing, aiming to achieve optimal performance with minimal FPGA resource usage. It discusses motivations for utilizing FPGAs in embedded systems, the challenges of balancing performance and resource utilization, and various options such as custom accelerators and soft processors. The focus remains on overlay architecture, emphasizing flexible and reusable designs that enable fast design cycles and efficient data processing. The discussion also includes the concept of hybrid vector-SIMD processing.

  • Overlay Architecture
  • Embedded Processing
  • FPGA
  • Soft Processor
  • Hybrid Vector-SIMD

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. The University of British Columbia vector blox VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston 1

  2. Motivation FPGAs for embedded processing Exist in many embedded systems already Glue logic, high speed I/O s Ideally do data processing on-chip Reduced power, footprint, bill of materials Several goals to balance Performance (must meet, no(?) bonus for exceeding) FPGA resource usage; more -> larger FPGA needed Human resource requirements & time (cost/time to market) Ease of programming, debugging Flexibility & reusability Both short (run-time) term and long term 2

  3. Some Options: Custom accelerator High performance, low device usage Time-consuming to design and debug 1 hardware accelerator per function Hard processor Fixed number (1 currently) of relatively simple processors Fixed balance for all applications Data parallelism (SIMD) ILP, MLP, speculation, etc. High level synthesis Overlay architectures 3

  4. Our Focus Overlay architecture Start with full data parallel engine Prune unneeded functions later Fast design cycles (productivity) Compile software, don t re-run synthesis/place & route Flexible, reusable Could use HLS techniques on top of it Restricted programming model Forces programmer to write good code Can be inferred from flexible (C code) where appropriate 6

  5. Overlay Options: Soft processor: limited performance Single issue, in-order 2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA Expensive to implement CAMs for OoOE Multiprocessor-on-FPGA: complexity Parallel programming and debugging Area overhead for interconnect Cache coherence, memory consistency Soft vector processor: balance For common embedded multimedia applications Easily scalable data parallelism; just add more lanes (ALUs) 7

  6. Hybrid Vector-SIMD C 6 4 2 0 E for( i=0; i<8; i++ ) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } C E 7 3 1 5 10

  7. Previous SVP Work 1st Gen: VIPERS (UBC) & VESPA (UofT) Similar to Cray-1 / VIRAM Vector data in registers Load/store for memory accesses 2nd Gen: VEGAS (UBC) Optimized for FPGAs Vector data in flat scratchpad Addresses in scratchpad stored in separate register file Concurrent DMA engine 11

  8. VEGAS (2nd Gen) Architecture Vector Core: VEGAS Scalar Core: NiosII/f Concurrent execution: in order dispatch, explicit syncing VEGAS DMA engine: to/from external memory 12

  9. VENICE Architecture 3rd Generation Optimized for Embedded Multimedia Applications High Frequency, Low Area 14

  10. VENICE Overview Vector Extensions to NIOS Implemented Compactly and Elegantly Continues from VEGAS (2nd Gen) Scratchpad memory Asynchronous DMA transactions Sub-word SIMD (1x32, 2x16, 4x8 bit operations) Optimized for smaller implementations VEGAS achieves best performance/area at 4-8 lanes Vector programs don t scale indefinitely Communications networks scale > O(N) VENICE targets 1-4 lanes About 50% .. 75% of the size of VEGAS 15

  11. VENICE Architecture 16

  12. Selected Contributions In-pipeline vector alignment network No need for separate move/element shift instructions Increased performance for sliding windows Ease of programming/compiling (data packing) Only shift/rotate vector elements Scales O( N * log(N) ) Acceptable to use multiple networks for few (1 to 4) lanes 17

  13. Scratchpad Alignment 18

  14. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Set # of repeated ops, increment on sources and dest Increased instruction dispatch rate for multimedia Replaces increment/window of address registers (2nd Gen) Had to be implemented in registers instead of BRAM Modified 3 registers per cycle 19

  15. VENICE Programming FIR using 2D Vector Ops int num_taps, num_samples; int16_t *v_output, *v_coeffs, *v_input; // Set up 2D vector parameters: vector_set_vl( num_taps ); // inner loop count vector_set_2D( num_samples, // outer loop count 1*sizeof(int16_t), // dest gap (-num_taps )*sizeof(int16_t), // srcA gap (-num_taps+1)*sizeof(int16_t) ); // srcB gap // Execute instruction; does 2D loop over entire input, multiplying and accumulating vector_acc_2D( VVH, VMULLO, v_output, v_coeffs, v_input ); 20

  16. Area Breakdown VENICE: Lower control & overall area ICN (alignment) scales faster but does not dominate 21

  17. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction Previous SVPs had separate flag registers/scratchpad Instead, encode a flag bit with each byte/halfword/word Can be stored in 9th bit of 9/18/36 bit wide BRAMs 22

  18. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation To support 1x32-bit, 2x16-bit, 4x8-bit multiplies 23

  19. Fracturable Multipliers in Stratix III/IV DSPs 18x18 32 byte 0 / halfword 0 byte 1 36x36 64 byte 0 / halfword 0 / word 16 18x18 18x18 32 byte 1 / halfword 1 64 word <<16 9x9 16 byte 2 16 byte 2 18x18 32 9x9 16 byte 3 / halfword 1 <<32 byte 3 18x18 VEGAS Fracturable Multiplier efficiently packs into 2 DSP blocks VENICE Fracturable Multiplier saves ALMs, packs inefficiently into 3 DSP blocks 2 DSP Blocks + extra logic 2 DSP Blocks (no extra logic) 24

  20. Selected Contributions In-pipeline vector alignment network 2D/3D vector operations Single flag/condition code per instruction More efficient hybrid multiplier implementation Increased frequency 200MHz+ vs ~125MHz of previous SVPs Deeper pipelining Optimized circuit design 25

  21. Table III BENCHMARK PERFORMANCE AND PROPERTIES Performance (Millions of elem. per second) Nios II/f V1 0.46 5.94 4.56 17.68 5.20 6.74 4.83 77.63 2.11 16.82 0.10 0.74 0.09 2.37 3.32 20.11 11.7 148.20 Speedup V2 24.2 4.7 2.1 30.1 12.7 14.4 48.2 10.5 27.4 13.8 Data Type In/Out halfword byte byte halfword byte byte byte halfword word Benchmark Properties Data Set Size 1024 896 606 896 606 320 240 320 240 128 21 32 32 4096 1024 1024 Benchmark autocor rgbcmyk rgbyiq imgblend filt3x3 median motest fir matmul V2 V4 V1 12.9 3.9 1.3 16.1 8.0 7.3 27.4 6.1 12.6 7.95 V4 41.2 5.0 3.0 52.0 17.2 26.6 72.4 12.5 50.6 20.6 Intermed. word Taps 16 Origin EEMBC EEMBC EEMBC VIRAM VIRAM custom custom custom custom 11.11 21.41 11.09 145.57 26.95 1.45 4.18 34.95 322.22 18.94 22.72 15.61 251.18 36.42 2.69 6.29 41.67 593.75 Geomean word halfword 3 3 5 5 16 16 16 Average Speedup vs. ALMs Speedup& (rela+ve& to& Nios& II/f,& 25" geomean& of& benchmarks)& 20" Nios"II/f" Single"CPU" Nios"II/f" Ideal"Scaling" VEGAS" V1,"V2,"V4" VENICE" V1,"V2,"V4" 15" 10" 5" 0" 0" 2" 4" 6" 8" Area& (rela+ve& to& Nios& II/f,& ALM& count)& 26 Figure 6. Speedup (geomean of 9 Benchmarks) vs Area Scaling Figure 8. 16-bit 4x4 DCT fallsbelow 1.0 on VENICE, meaning NiosII/f isbetter. This is because the area overhead of 1.8 exceeds the speedup of 1.3 . The limited speedup is due to a combination of memory access patterns (r,g,b triplets) and wide intermedi- ate data (32b) to prevent overflows. However, on average, VENICE-V1 offers3.8 greater computational density than Nios II/f, and 2.3 greater density than VEGAS-V1. Comparing V4 configuration results (not shown), the computational density of VENICE is 5.2 , while VEGAS is 2.7 that of Nios. Figure 7. Computational Density with V1 SVPs D. Case Study: DCT VENICE wasdesigned to exploit vector parallelism, even when vectorsareshort. By remaining small, VENICE can be efficiently deployed in multiprocessor systems to efficiently exploit other forms of parallelism (eg, thread-level) on top of the vector-level parallelism. In this section, we use VENICE to perform a 4x4 DCT with 16-bit elements on a total of 8192 different matrices. Each DCT is implemented using two matrix multiplies followed by shifts for normalization. Vectors are short, limited to four halfwords for the matrix multiply. In Figure 8, the first set of bars shows the benefit of using 2D and 3D vector operations with a V2 VENICE VEGAS and VENICE using a small V1 configuration. Simple benchmarks such as rgbcmyk, rgbyiq, imgblend and median achieve the smallest performance increase over VEGAS. These benchmarks have large vector lengths and no misaligned vectors, and so the speedup comes mostly from clock speed increase. Convolution benchmarks like fir andautocor benefit from thelack of misalignment penalty on VENICE. The2D vectorsaccelerate autocor, motest, and fir. On matmul, using 3D vectorsand theaccumulators achieves 3.2 the performance of VEGAS. For one application, r gbyi q, the computational density 7

  22. Table III BENCHMARK PERFORMANCE AND PROPERTIES Performance (Millions of elem. per second) Nios II/f V1 0.46 5.94 4.56 17.68 5.20 6.74 4.83 77.63 2.11 16.82 0.10 0.74 0.09 2.37 3.32 20.11 11.7 148.20 Speedup V2 24.2 4.7 2.1 30.1 12.7 14.4 48.2 10.5 27.4 13.8 Data Type In/Out halfword byte byte halfword byte byte byte halfword word Benchmark Properties Data Set Size 1024 896 606 896 606 320 240 320 240 128 21 32 32 4096 1024 1024 Benchmark autocor rgbcmyk rgbyiq imgblend filt3x3 median motest fir matmul V2 V4 V1 12.9 3.9 1.3 16.1 8.0 7.3 27.4 6.1 12.6 7.95 V4 41.2 5.0 3.0 52.0 17.2 26.6 72.4 12.5 50.6 20.6 Intermed. word Taps 16 Origin EEMBC EEMBC EEMBC VIRAM VIRAM custom custom custom custom 11.11 21.41 11.09 145.57 26.95 1.45 4.18 34.95 322.22 18.94 22.72 15.61 251.18 36.42 2.69 6.29 41.67 593.75 Geomean word halfword 3 3 5 5 16 16 16 Computational Density Figure 6. Speedup (geomean of 9 Benchmarks) vs Area Scaling 100$ Figure 8. 16-bit 4x4 DCT VEGAS?V1$ VENICE?V1$ (rela. ve& to& Nios& II/f)& Speedup& per& ALM& fallsbelow 1.0 on VENICE, meaning NiosII/f isbetter. This is because the area overhead of 1.8 exceeds the speedup of 1.3 . The limited speedup is due to a combination of memory access patterns (r,g,b triplets) and wide intermedi- ate data (32b) to prevent overflows. However, on average, VENICE-V1 offers3.8 greater computational density than Nios II/f, and 2.3 greater density than VEGAS-V1. Comparing V4 configuration results (not shown), the computational density of VENICE is 5.2 , while VEGAS is 2.7 that of Nios. 10$ & 1$ autcor$ rgbcmyk$ rgbyiq$ imgblend$ filt3x3$ median$ motest$ int$ matmul$ fir$ geomean$ 0.1$ 27 Figure 7. Computational Density with V1 SVPs D. Case Study: DCT VENICE wasdesigned to exploit vector parallelism, even when vectorsareshort. By remaining small, VENICE can be efficiently deployed in multiprocessor systems to efficiently exploit other forms of parallelism (eg, thread-level) on top of the vector-level parallelism. In this section, we use VENICE to perform a 4x4 DCT with 16-bit elements on a total of 8192 different matrices. Each DCT is implemented using two matrix multiplies followed by shifts for normalization. Vectors are short, limited to four halfwords for the matrix multiply. In Figure 8, the first set of bars shows the benefit of using 2D and 3D vector operations with a V2 VENICE VEGAS and VENICE using a small V1 configuration. Simple benchmarks such as rgbcmyk, rgbyiq, imgblend and median achieve the smallest performance increase over VEGAS. These benchmarks have large vector lengths and no misaligned vectors, and so the speedup comes mostly from clock speed increase. Convolution benchmarks like fir andautocor benefit from thelack of misalignment penalty on VENICE. The2D vectorsaccelerate autocor, motest, and fir. On matmul, using 3D vectorsand theaccumulators achieves 3.2 the performance of VEGAS. For one application, r gbyi q, the computational density 7

  23. Future Work Scatter/gather functionality Vector indexed memory accesses Need to coalesce to get good bandwidth utilization Automatic pruning of unneeded overlay functionality HLS in reverse Hybrid HLS overlays Start from overlay and synthesize in extra functionality? Overlay + custom instructions? 29

  24. Conclusions Soft Vector Processors Scalable performance No hardware recompiling necessary VENICE Optimized for FPGAs, 1 to 4 lanes 5X Performance/Area of Nios II/f on multimedia benchmarks 30

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#