High-Level Synthesis (HLS) Process

 
L5: HLS Overview
 
Cong (Callie) Hao
callie.hao@ece.gatech.edu
 
Assistant Professor
ECE, Georgia Institute of Technology
 
Sharc-lab @ Georgia Tech 
https://sharclab.ece.gatech.edu/
 
Acknowledgement: some contents are borrowed from Prof. Zhiru Zhang at Cornell University
undefined
 
o
Including Tutorial
 
Due: Feb. 8
th
 11:59 pm
 
Optimization for a matrix multiplication kernel
o
Will use techniques we’re about to learn next week
 
2
 
About Lab 1
undefined
 
3
 
High-Level Synthesis (HLS)
undefined
 
4
 
What is HLS?
 
Software
Specification and
Program
 
Circuit (ASIC, FPGA) Design
High-Level Synthesis
Logic Synthesis
Physical Synthesis
 
An 
a
utomated
 design process that transforms
a 
high-level functional specification 
to
optimized 
register-transfer level 
(RTL)
descriptions for efficient hardware
implementation
undefined
 
5
 
What is HLS?
for(int h = 0; h < H; h++ )
  for(int w = 0; w < W; w++)
    for(int m = 0; m < K; m++)
      for(int n = 0; n < K; n++)
        ...
 
HLS Tools
 
Behavioral-level:
Expressive and concise
 
Register-Transfer-Level (RTL):
???
undefined
 
6
 
High-Level Synthesis (HLS)
undefined
 
Productivity
o
L
ower design complexity and faster simulation speed
o
Ease-of-use: 
C/C++/Python v.s. Verilog
Portability
o
Single source -> multiple implementations (devices)
Permutability
o
Much more optimization 
opportunities at higher level
o
Rapid design space exploration -> higher quality of result (QoR)
Bonus:
o
Promote device usage
o
Significant code size reduction
Shorter simulation/verification cycle
Quick / early design iterations
 
7
 
Why HLS (Design at Higher Level)?
undefined
 
8
 
Why HLS (Design at Higher Level)?
Code
Size
300K Line
40K Line
1M Gate
?? Transis.
Logic level
Simulation
Speed
100 ~ 1KHz
~ 1MHz
10 ~ 100Hz
10K~100KHz
 
Minutes
~ hours
 
Hours
~ days
 
Days
~ 
weeks
Performance
Improvement
10 ~ 20X
2 ~ 5X
20 ~ 50%
 
[source: Wakabayashi, DAC’05 tutorial]
undefined
 
9
 
High-Level Synthesis (HLS)
 
How to design (a better) HLS tool
How to use HLS tool
undefined
 
Where does performance gain come from? 
Specialization
!
 
Data type 
specialization
o
Arbitrary-precision fixed-point, custom floating-point
Interface
/communication specialization
o
Streaming, memory-mapped I/O, etc.
Memory
 specialization
o
Array partitioning, data reuse, etc.
Compute
 specialization
o
Unrolling, pipelining, dataflow, multithreading, etc.
Architecture
 specialization
o
Pipelined, recursive, hybrid, etc.
 
10
 
Hardware Specialization 
with HLS
 
1. System-level
Architecture,
interface
 
2. Module-level
Compute,
memory
 
3. Bit-level
Data type
undefined
 
All about “
pragma
”s: instructions to tell your compiler how to build the hardware
This link has all the 
pragmas
 you need:
o
https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas
 
 
 
11
 
HLS Pragmas
undefined
 
This link has everything you need to know about HLS…
o
https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Getting-Started-with-
Vitis-HLS
o
I can be 100% substituted by this link 
 
12
 
More HLS Resources
undefined
 
Hardware is 
structured
, 
hierarchical
, and 
deterministic
 at compile time
 
So are 
Verilog
 and 
HLS
 
13
 
Hardware Specialization 
with HLS
undefined
 
Hierarchical HDL structures are achieved by defining modules (
definition
) and
instantiating modules (
instance
)
o
Instantiation is the process of “calling” a module
 
14
 
Structures and Hierarchy
 
module
 TOP ( port_list );
 
ALU U1 ( port_connection );
 
MEM U2 ( port_connection );
endmodule
 
module
 ALU ( port_list );
 
FIFO S1 ( port_connection );
endmodule
 
Instance
 
Definition
undefined
 
Module header starts with module keyword, contains the 
I/O ports
Port declarations begins with 
output
, 
input
 or 
inout
 follow by bus indices
o
Provide the interface by which a module can communicate with the environment
 
15
 
Module Ports
 
a [1:0]
 
b [1:0]
 
c
 
d [1:0]
 
e
undefined
 
16
 
Basic Mapping Rule from C/C++ to RTL
undefined
 
17
 
Basic Mapping Rule from C/C++ to RTL
void 
Foo_C
() {...}
void 
Foo_A
() {...}
Void 
Foo_B
() {
 
Foo_C
();
}
 
void 
top
() {
 
Foo_A
();
 
Foo_B
();
 
...
 
Foo_B
();
}
 
top
 
Foo_A
 
Foo_B
 
Foo_C
 
Resource sharing: only one
instance
 of Foo_B on
hardware
 
C Source Code
 
RTL Hierarchy
undefined
 
18
 
Basic Mapping Rule from C/C++ to RTL
void 
top
(
int* in1, int* in2, int* out
) {
 
*out = *in1 + *in2;
}
 
top
 
C Source Code
Datapath
FSM
 
in1
 
in2
 
out
 
in1_vld
 
in2_vld
 
out_vld
undefined
 
19
 
Basic Mapping Rule from C/C++ to RTL
for (i = 0; i < N; i++)
 
A[i+x] = A[i] + i;
 
C Source Code
undefined
 
20
 
Deterministic at Compile Time
 
On FPGA, memory maps to BRAM
Everything must be decided at compile time – your hardware cannot be
changed while running!
o
Adding one more piece of memory after the circuit is built?
int
 mem[var];
int
 mem* = malloc(var * sizeof(
int
));
reg 
[0:7] mem [
var
:0];
 
8-bit element
 
8-bit element
 
8-bit element
 
 
 
How
many??
undefined
 
Where does performance gain come from? 
Specialization
!
 
Data type 
specialization
o
Arbitrary-precision fixed-point, custom floating-point
Interface/communication specialization
o
Streaming, memory-mapped I/O, etc.
Memory specialization
o
Array partitioning, data reuse, etc.
Compute specialization
o
Unrolling, pipelining, dataflow, multithreading, etc.
Architecture specialization
o
Pipelined, recursive, hybrid, etc.
 
21
 
Hardware Specialization 
with HLS
 
1. System-level
Architecture,
interface
 
2. Module-level
Compute,
memory
 
3. Bit-level
Data type
Discuss data type next week
Remember:
 FPGA doesn’t like floating point!
Use integer at least :D
undefined
 
Where does performance gain come from? 
Specialization
!
 
Data type specialization
o
Arbitrary-precision fixed-point, custom floating-point
Interface/communication specialization
o
Streaming, memory-mapped I/O, etc.
Memory
 specialization
o
Array partitioning, data reuse, etc.
Compute
 specialization
o
Unrolling, pipelining, dataflow, multithreading, etc.
Architecture specialization
o
Pipelined, recursive, hybrid, etc.
 
22
 
Hardware Specialization 
with HLS
The Three Musketeers
(i) Array partition
(ii) Loop unroll
(iii) Loop pipeline
undefined
 
Initially, an array is mapped to one (or more) block(s) of RAM
 (or BRAM on FPGA)
o
One block of RAM has at most 
two ports
o
At most 
two
 read/write operations can be done in 
one clock cycle – 
Parallelism is
 2 
(too low)
An array can be 
partitioned
 and mapped to 
multiple
 blocks of RAMs
 
23
 
Array Partition – Memory Parallelism
 
4 RAM blocks
can be accessed
simultaneously!
 
1 RAM block
undefined
 
Initially, an array is mapped to one (or more) block(s) of RAM
 (or BRAM on FPGA)
o
One block of RAM has at most 
two
 
ports
o
At most 
two
 read/write operations can be done in 
one
 
clock
 
cycle
 
 
Parallelism is 
2
 (too low)
An array can be 
partitioned
 and mapped to 
multiple
 blocks of RAMs
o
Can also be partitioned into individual elements and mapped to registers
Only if your array is small otherwise the tool will give up
 
24
 
Array Partition – Memory Parallelism
 
4 RAM blocks
can be accessed
simultaneously!
 
1 RAM block
 
OR
 
All registers
undefined
 
Loop unrolling 
to expose higher parallelism and achieve shorter latency
o
Pros
Decrease loop overhead
Increase parallelism for scheduling
o
Cons
Increase operation count, which may negatively impact area, power, and timing
 
25
 
Loop Unrolling
for (int i = 0; i < N; i++)
#pragma HLS unroll
  A[i] = B[i] + C[i];
A[0] = B[0] + C[0];
A[1] = B[1] + C[1];
A[2] = B[2] + C[2];
...
 
N x m cycles
Assume A[i] = B[i] + C[i] takes 
m
 cycle
 
Original Loop
 
Unrolled Loop
 
m cycle
Only if A, B, and C are fully partitioned!
undefined
 
Loop pipelining 
is one of the most important optimizations for high-level synthesis
o
Allows a new iteration to begin processing before the previous iteration is complete
o
Key metric: 
Initiation Interval (II)
 
in # cycles
 
26
 
Loop Pipelining
for (i = 0; i < N; ++i)
#pragma HLS pipeline
      p[i] = x[i] * y[i];
 
i=0
 
i=1
 
i=2
 
cycles
 
i=3
 
ld
 – Load
st 
 Store
ld
ld
×
st
 
x[i]
 
y[i]
 
p[i]
 
II = 1
undefined
 
RAM block 2
 
The three techniques are frequently used together to boost computation efficiency
 
27
 
Put-together: Pipeline + Unroll +Partition
 
RAM block 1
 
RAM block 3
 
 
 
 
for (int i = 0; i < N; i++) {
 
  for (int j = 0; j < M; j++) {
 
    A[i][j] = B[i][j] * C[i][j];
  }
}
 
A
 
B
 
C
undefined
 
RAM block 2
 
The three techniques are frequently used together to boost computation efficiency
 
28
 
Put-together: Pipeline + Unroll +Partition
 
RAM block 1
 
RAM block 3
 
 
 
 
for (int i = 0; i < N; i++) {
 
  for (int j = 0; j < M; j++) {
#pragma HLS unroll
    A[i][j] = B[i][j] * C[i][j];
  }
}
 
Compute in
parallel
 
Compute in
parallel
 
Compute in
parallel
 
Memory ports limited by 2 
 Need to partition
undefined
 
Block N
 
The three techniques are frequently used together to boost computation efficiency
 
29
 
Put-together: Pipeline + Unroll +Partition
 
Block 1
#pragma HLS array_partition variable=A dim=2 complete
#pragma HLS array_partition variable=B dim=2 complete
#pragma HLS array_partition variable=C dim=2 complete
 
for (int i = 0; i < N; i++) {
 
  for (int j = 0; j < M; j++) {
#pragma HLS unroll
    A[i][j] = B[i][j] * C[i][j];
  }
}
 
Block 2
 
undefined
 
The three techniques are frequently used together to boost computation efficiency
 
30
 
Put-together: Pipeline + Unroll +Partition
#pragma HLS array_partition variable=A dim=2 complete
#pragma HLS array_partition variable=B dim=2 complete
#pragma HLS array_partition variable=C dim=2 complete
 
for (int i = 0; i < N; i++) {
 
  for (int j = 0; j < M; j++) {
#pragma HLS unroll
    A[i][j] = B[i][j] * C[i][j];
  }
}
 
i=0
 
i=1
 
i=2
undefined
 
The three techniques are frequently used together to boost computation efficiency
 
31
 
Put-together: Pipeline + Unroll +Partition
#pragma HLS array_partition variable=A dim=2 complete
#pragma HLS array_partition variable=B dim=2 complete
#pragma HLS array_partition variable=C dim=2 complete
 
for (int i = 0; i < N; i++) {
#pragma HLS pipeline II=1
  for (int j = 0; j < 32; j++) {
#pragma HLS unroll factor=8
    A[i][j] = B[i][j] * C[i][j];
  }
}
 
i=0
 
i=1
 
i=2
 
i=0
 
i=1
 
i=2
undefined
 
Where does performance gain come from? 
Specialization
!
 
Data type specialization
o
Arbitrary-precision fixed-point, custom floating-point
Interface/communication specialization
o
Streaming, memory-mapped I/O, etc.
Memory specialization
o
Array partitioning, data reuse, etc.
Compute specialization
o
Unrolling, pipelining, dataflow, multithreading, etc.
Architecture
 specialization
o
Pipelined, recursive, hybrid, etc.
 
32
 
Hardware Specialization 
with HLS
 
1. System-level
Architecture,
interface
 
2. Module-level
Compute,
memory
 
3. Bit-level
Data type
Talk more in the future
undefined
 
HLS is good…!
HLS is all about pragmas
Optimization starting point:
o
Memory partition + loop unrolling + loop pipelining
Next lecture:
o
More about loop optimization
 
33
 
Summary
Slide Note
Embed
Share

High-Level Synthesis (HLS) is an automated design process that converts functional specifications into optimized hardware implementations at the Register-Transfer Level (RTL). It offers efficient hardware development using software specifications and program logic synthesis. HLS tools such as Verilog, VHDL, C/C++, and Chisel play a crucial role in circuit design for ASIC and FPGA applications.

  • HLS
  • High-Level Synthesis
  • RTL
  • Hardware Implementation
  • Circuit Design

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Acknowledgement: some contents are borrowed from Prof. Zhiru Zhang at Cornell University L5: HLS Overview Cong (Callie) Hao callie.hao@ece.gatech.edu Assistant Professor ECE, Georgia Institute of Technology Sharc-lab @ Georgia Tech https://sharclab.ece.gatech.edu/

  2. About Lab 1 Released here: https://github.com/sharc-lab/FPGA_ECE8893 o Including Tutorial Due: Feb. 8th 11:59 pm Optimization for a matrix multiplication kernel o Will use techniques we re about to learn next week Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 2

  3. High-Level Synthesis (HLS) What? Why? How? Future? Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 3

  4. What is HLS? An automated design process that transforms a high-level functional specification to optimized register-transfer level (RTL) descriptions for efficient hardware implementation High-Level Synthesis Software Specification and Program Logic Synthesis for (i=1; i<=c;) a = a++; b = x*2-a; a = y+b/3; a = y+b/3; a = y+b/3; REG out Physical Synthesis REG for (i=1; i<=c;) a = a++; b = x*2-a; b = x*2-a; + REG + for (i=1; i<=c;) a = a++; HLS Tools + * * in * in in Circuit (ASIC, FPGA) Design Verilog, VHDL, C / C++, Chisel, Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 4

  5. What is HLS? 44 req0 <= 0; 45 repeat (1) @ (posedge clk); 46 #10 $finish; 47 end 48 49 // Connect the DUT 50 arbiter U ( 51 clk, 52 rst, ... (posedge clk); 39 req3 <= 1; ... 32 repeat (1) @ (posedge clk); 33 req0 <= 1; 34 req1 <= 1; 35 repeat (1) @ (posedge clk); 36 req2 <= 1; 37 req1 <= 0; 38 repeat (1) @ ("arbiter.vcd"); 20 $dumpvars(); 21 clk = 0; 22 rst = 1; ... 15 // Clock generator 16 always #1 clk = ~clk; 17 18 initial begin 19 $dumpfile 6 reg req3; 7 reg req2; 8 reg req1; 9 reg req0; 10 wire gnt3; 11 wire gnt2; 12 wire gnt1; 13 wire gnt0; 1 `include xxx.v" 2 module top (); 3 4 reg clk; 5 reg rst; for(int h = 0; h < H; h++ ) for(int w = 0; w < W; w++) for(int m = 0; m < K; m++) for(int n = 0; n < K; n++) ... HLS Tools Behavioral-level: Expressive and concise Register-Transfer-Level (RTL): ??? Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 5

  6. High-Level Synthesis (HLS) What? Why? How? Future? Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 6

  7. Why HLS (Design at Higher Level)? Productivity o Lower design complexity and faster simulation speed o Ease-of-use: C/C++/Python v.s. Verilog Portability o Single source -> multiple implementations (devices) Permutability o Much more optimization opportunities at higher level o Rapid design space exploration -> higher quality of result (QoR) Bonus: o Promote device usage o Significant code size reduction Shorter simulation/verification cycle Quick / early design iterations Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 7

  8. Why HLS (Design at Higher Level)? Code Size Simulation Speed Performance Improvement System level 10 ~ 20X 40K Line ~ 1MHz Behavior level C/C++ Minutes ~ hours 10K~100KHz High-Level Synthesis Register- transfer level 2 ~ 5X RTL (Verilog) 300K Line 100 ~ 1KHz Hours ~ days Logic Synthesis Logic level 1M Gate 10 ~ 100Hz Netlist Physical Synthesis Days ~ weeks 20 ~ 50% Transistor level Layout level ?? Transis. [source: Wakabayashi, DAC 05 tutorial] Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 8

  9. High-Level Synthesis (HLS) What? Why? How? Future? How to design (a better) HLS tool How to use HLS tool Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 9

  10. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. 1. System-level Architecture, interface 2. Module-level Compute, memory 3. Bit-level Data type Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 10

  11. HLS Pragmas All about pragma s: instructions to tell your compiler how to build the hardware This link has all the pragmas you need: o https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 11

  12. More HLS Resources This link has everything you need to know about HLS o https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Getting-Started-with- Vitis-HLS o I can be 100% substituted by this link Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 12

  13. Hardware Specialization with HLS Hardware is structured, hierarchical, and deterministic at compile time So are Verilog and HLS Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 13

  14. Structures and Hierarchy Hierarchical HDL structures are achieved by defining modules (definition) and instantiating modules (instance) o Instantiation is the process of calling a module module TOP ( port_list ); ALU U1 ( port_connection ); MEM U2 ( port_connection ); endmodule Instance module ALU ( port_list ); FIFO S1 ( port_connection ); endmodule Definition Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 14

  15. Module Ports Module header starts with module keyword, contains the I/O ports Port declarations begins with output, input or inout follow by bus indices o Provide the interface by which a module can communicate with the environment a [1:0] d [1:0] b [1:0] e c Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 15

  16. Basic Mapping Rule from C/C++ to RTL C RTL Constructs Components Functions Modules Arguments I/O Ports Operators (+, *) Functional units (adder, multiplier) Scalars Wires or registers Arrays Memory Control flows Control logics (Finite State Machine) Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 16

  17. Basic Mapping Rule from C/C++ to RTL C RTL Constructs Components C Source Code RTL Hierarchy Functions Modules top void Foo_C() {...} void Foo_A() {...} Void Foo_B() { Foo_C(); } Arguments I/O Ports Foo_A Foo_B Operators (+, *) Functional units (adder, multiplier) Foo_C void top() { Foo_A(); Foo_B(); ... Foo_B(); } Scalars Wires or registers Arrays Memory Resource sharing: only one instance of Foo_B on hardware Control flows Control logics (Finite State Machine) Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 17

  18. Basic Mapping Rule from C/C++ to RTL C RTL C Source Code Constructs Components void top(int* in1, int* in2, int* out) { *out = *in1 + *in2; } Functions Modules Arguments I/O Ports Operators (+, *) Functional units (adder, multiplier) top in1 Scalars Wires or registers Datapath out in2 Arrays Memory in1_vld out_vld FSM Control flows Control logics (Finite State Machine) in2_vld Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 18

  19. Basic Mapping Rule from C/C++ to RTL C Source Code C RTL Constructs Components for (i = 0; i < N; i++) A[i+x] = A[i] + i; Functions Modules Arguments I/O Ports top Operators (+, *) Functional units (adder, multiplier) A[N-1] RAM A[N-2] Scalars Wires or registers Arrays Memory A[1] A[0] Control flows Control logics (Finite State Machine) Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 19

  20. Deterministic at Compile Time On FPGA, memory maps to BRAM Everything must be decided at compile time your hardware cannot be changed while running! o Adding one more piece of memory after the circuit is built? int mem[var]; int mem* = malloc(var * sizeof(int)); 8-bit element reg [0:7] mem [var:0]; How many?? 8-bit element 8-bit element Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 20

  21. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. 1. System-level Architecture, interface Discuss data type next week Remember: FPGA doesn t like floating point! Use integer at least :D 2. Module-level Compute, memory 3. Bit-level Data type Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 21

  22. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. The Three Musketeers (i) Array partition (ii) Loop unroll (iii) Loop pipeline Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 22

  23. Array Partition Memory Parallelism Initially, an array is mapped to one (or more) block(s) of RAM (or BRAM on FPGA) o One block of RAM has at most two ports o At most two read/write operations can be done in one clock cycle Parallelism is 2 (too low) An array can be partitioned and mapped to multiple blocks of RAMs A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[0] A[4] A[1] A[5] A[2] A[6] A[3] A[7] 4 RAM blocks can be accessed simultaneously! 1 RAM block Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 23

  24. Array Partition Memory Parallelism Initially, an array is mapped to one (or more) block(s) of RAM (or BRAM on FPGA) o One block of RAM has at most two ports o At most two read/write operations can be done in one clock cycle Parallelism is 2 (too low) An array can be partitioned and mapped to multiple blocks of RAMs o Can also be partitioned into individual elements and mapped to registers Only if your array is small otherwise the tool will give up A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[0] A[4] A[1] A[5] A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[0] A[1] A[2] A[3] A[0] A[4] A[1] A[5] OR A[2] A[6] A[3] A[7] A[4] A[5] A[6] A[7] A[2] A[6] 4 RAM blocks can be accessed simultaneously! A[3] A[7] All registers 1 RAM block Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 24

  25. Loop Unrolling Loop unrolling to expose higher parallelism and achieve shorter latency o Pros Decrease loop overhead Increase parallelism for scheduling o Cons Increase operation count, which may negatively impact area, power, and timing Original Loop Unrolled Loop A[0] = B[0] + C[0]; A[1] = B[1] + C[1]; A[2] = B[2] + C[2]; ... for (int i = 0; i < N; i++) #pragma HLS unroll A[i] = B[i] + C[i]; N x m cycles m cycle Assume A[i] = B[i] + C[i] takes m cycle Only if A, B, and C are fully partitioned! Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 25

  26. Loop Pipelining Loop pipelining is one of the most important optimizations for high-level synthesis o Allows a new iteration to begin processing before the previous iteration is complete o Key metric: Initiation Interval (II) in # cycles for (i = 0; i < N; ++i) #pragma HLS pipeline p[i] = x[i] * y[i]; x[i] y[i] ld ld II = 1 ld st i=0 ld st i=1 st ld st i=2 p[i] ld i=3 st ld Load st Store cycles Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 26

  27. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency RAM block 1 A[0][0] A[1][0] A[M-1][0] A[0][1] A[1][1] A[M-1][1] A[0][N-1] A[1][N-1] A[M-1][N-1] A for (int i = 0; i < N; i++) { RAM block 2 B[0][0] B for (int j = 0; j < M; j++) { A[i][j] = B[i][j] * C[i][j]; } } RAM block 3 C[0][0] C Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 27

  28. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency RAM block 1 Compute in parallel A[0][0] A[1][0] A[M-1][0] A[0][1] A[1][1] A[M-1][1] A[0][N-1] A[1][N-1] A[M-1][N-1] for (int i = 0; i < N; i++) { RAM block 2 B[0][0] Compute in parallel for (int j = 0; j < M; j++) { #pragma HLS unroll A[i][j] = B[i][j] * C[i][j]; } } RAM block 3 C[0][0] Compute in parallel Memory ports limited by 2 Need to partition Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 28

  29. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency Block 2 Block 1 Block N A[0][N-1] A[1][N-1] A[M-1][N-1] A[0][0] A[1][0] A[M-1][0] A[0][1] A[1][1] A[M-1][1] #pragma HLS array_partition variable=A dim=2 complete #pragma HLS array_partition variable=B dim=2 complete #pragma HLS array_partition variable=C dim=2 complete for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { #pragma HLS unroll A[i][j] = B[i][j] * C[i][j]; } } B[0][0] B[0][0] Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 29

  30. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency #pragma HLS array_partition variable=A dim=2 complete #pragma HLS array_partition variable=B dim=2 complete #pragma HLS array_partition variable=C dim=2 complete i=0 i=2 i=1 for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { #pragma HLS unroll A[i][j] = B[i][j] * C[i][j]; } } Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 30

  31. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency #pragma HLS array_partition variable=A dim=2 complete #pragma HLS array_partition variable=B dim=2 complete #pragma HLS array_partition variable=C dim=2 complete i=0 i=2 i=1 for (int i = 0; i < N; i++) { #pragma HLS pipeline II=1 for (int j = 0; j < 32; j++) { #pragma HLS unroll factor=8 A[i][j] = B[i][j] * C[i][j]; } } i=0 i=1 i=2 Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 31

  32. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. 1. System-level Architecture, interface 2. Module-level Compute, memory 3. Bit-level Data type Talk more in the future Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 32

  33. Summary HLS is good ! HLS is all about pragmas Optimization starting point: o Memory partition + loop unrolling + loop pipelining Next lecture: o More about loop optimization Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 33

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#