Understanding High-Level Synthesis (HLS) Process

Slide Note
Embed
Share

High-Level Synthesis (HLS) is an automated design process that converts functional specifications into optimized hardware implementations at the Register-Transfer Level (RTL). It offers efficient hardware development using software specifications and program logic synthesis. HLS tools such as Verilog, VHDL, C/C++, and Chisel play a crucial role in circuit design for ASIC and FPGA applications.


Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Acknowledgement: some contents are borrowed from Prof. Zhiru Zhang at Cornell University L5: HLS Overview Cong (Callie) Hao callie.hao@ece.gatech.edu Assistant Professor ECE, Georgia Institute of Technology Sharc-lab @ Georgia Tech https://sharclab.ece.gatech.edu/

  2. About Lab 1 Released here: https://github.com/sharc-lab/FPGA_ECE8893 o Including Tutorial Due: Feb. 8th 11:59 pm Optimization for a matrix multiplication kernel o Will use techniques we re about to learn next week Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 2

  3. High-Level Synthesis (HLS) What? Why? How? Future? Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 3

  4. What is HLS? An automated design process that transforms a high-level functional specification to optimized register-transfer level (RTL) descriptions for efficient hardware implementation High-Level Synthesis Software Specification and Program Logic Synthesis for (i=1; i<=c;) a = a++; b = x*2-a; a = y+b/3; a = y+b/3; a = y+b/3; REG out Physical Synthesis REG for (i=1; i<=c;) a = a++; b = x*2-a; b = x*2-a; + REG + for (i=1; i<=c;) a = a++; HLS Tools + * * in * in in Circuit (ASIC, FPGA) Design Verilog, VHDL, C / C++, Chisel, Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 4

  5. What is HLS? 44 req0 <= 0; 45 repeat (1) @ (posedge clk); 46 #10 $finish; 47 end 48 49 // Connect the DUT 50 arbiter U ( 51 clk, 52 rst, ... (posedge clk); 39 req3 <= 1; ... 32 repeat (1) @ (posedge clk); 33 req0 <= 1; 34 req1 <= 1; 35 repeat (1) @ (posedge clk); 36 req2 <= 1; 37 req1 <= 0; 38 repeat (1) @ ("arbiter.vcd"); 20 $dumpvars(); 21 clk = 0; 22 rst = 1; ... 15 // Clock generator 16 always #1 clk = ~clk; 17 18 initial begin 19 $dumpfile 6 reg req3; 7 reg req2; 8 reg req1; 9 reg req0; 10 wire gnt3; 11 wire gnt2; 12 wire gnt1; 13 wire gnt0; 1 `include xxx.v" 2 module top (); 3 4 reg clk; 5 reg rst; for(int h = 0; h < H; h++ ) for(int w = 0; w < W; w++) for(int m = 0; m < K; m++) for(int n = 0; n < K; n++) ... HLS Tools Behavioral-level: Expressive and concise Register-Transfer-Level (RTL): ??? Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 5

  6. High-Level Synthesis (HLS) What? Why? How? Future? Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 6

  7. Why HLS (Design at Higher Level)? Productivity o Lower design complexity and faster simulation speed o Ease-of-use: C/C++/Python v.s. Verilog Portability o Single source -> multiple implementations (devices) Permutability o Much more optimization opportunities at higher level o Rapid design space exploration -> higher quality of result (QoR) Bonus: o Promote device usage o Significant code size reduction Shorter simulation/verification cycle Quick / early design iterations Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 7

  8. Why HLS (Design at Higher Level)? Code Size Simulation Speed Performance Improvement System level 10 ~ 20X 40K Line ~ 1MHz Behavior level C/C++ Minutes ~ hours 10K~100KHz High-Level Synthesis Register- transfer level 2 ~ 5X RTL (Verilog) 300K Line 100 ~ 1KHz Hours ~ days Logic Synthesis Logic level 1M Gate 10 ~ 100Hz Netlist Physical Synthesis Days ~ weeks 20 ~ 50% Transistor level Layout level ?? Transis. [source: Wakabayashi, DAC 05 tutorial] Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 8

  9. High-Level Synthesis (HLS) What? Why? How? Future? How to design (a better) HLS tool How to use HLS tool Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 9

  10. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. 1. System-level Architecture, interface 2. Module-level Compute, memory 3. Bit-level Data type Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 10

  11. HLS Pragmas All about pragma s: instructions to tell your compiler how to build the hardware This link has all the pragmas you need: o https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 11

  12. More HLS Resources This link has everything you need to know about HLS o https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Getting-Started-with- Vitis-HLS o I can be 100% substituted by this link Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 12

  13. Hardware Specialization with HLS Hardware is structured, hierarchical, and deterministic at compile time So are Verilog and HLS Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 13

  14. Structures and Hierarchy Hierarchical HDL structures are achieved by defining modules (definition) and instantiating modules (instance) o Instantiation is the process of calling a module module TOP ( port_list ); ALU U1 ( port_connection ); MEM U2 ( port_connection ); endmodule Instance module ALU ( port_list ); FIFO S1 ( port_connection ); endmodule Definition Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 14

  15. Module Ports Module header starts with module keyword, contains the I/O ports Port declarations begins with output, input or inout follow by bus indices o Provide the interface by which a module can communicate with the environment a [1:0] d [1:0] b [1:0] e c Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 15

  16. Basic Mapping Rule from C/C++ to RTL C RTL Constructs Components Functions Modules Arguments I/O Ports Operators (+, *) Functional units (adder, multiplier) Scalars Wires or registers Arrays Memory Control flows Control logics (Finite State Machine) Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 16

  17. Basic Mapping Rule from C/C++ to RTL C RTL Constructs Components C Source Code RTL Hierarchy Functions Modules top void Foo_C() {...} void Foo_A() {...} Void Foo_B() { Foo_C(); } Arguments I/O Ports Foo_A Foo_B Operators (+, *) Functional units (adder, multiplier) Foo_C void top() { Foo_A(); Foo_B(); ... Foo_B(); } Scalars Wires or registers Arrays Memory Resource sharing: only one instance of Foo_B on hardware Control flows Control logics (Finite State Machine) Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 17

  18. Basic Mapping Rule from C/C++ to RTL C RTL C Source Code Constructs Components void top(int* in1, int* in2, int* out) { *out = *in1 + *in2; } Functions Modules Arguments I/O Ports Operators (+, *) Functional units (adder, multiplier) top in1 Scalars Wires or registers Datapath out in2 Arrays Memory in1_vld out_vld FSM Control flows Control logics (Finite State Machine) in2_vld Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 18

  19. Basic Mapping Rule from C/C++ to RTL C Source Code C RTL Constructs Components for (i = 0; i < N; i++) A[i+x] = A[i] + i; Functions Modules Arguments I/O Ports top Operators (+, *) Functional units (adder, multiplier) A[N-1] RAM A[N-2] Scalars Wires or registers Arrays Memory A[1] A[0] Control flows Control logics (Finite State Machine) Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 19

  20. Deterministic at Compile Time On FPGA, memory maps to BRAM Everything must be decided at compile time your hardware cannot be changed while running! o Adding one more piece of memory after the circuit is built? int mem[var]; int mem* = malloc(var * sizeof(int)); 8-bit element reg [0:7] mem [var:0]; How many?? 8-bit element 8-bit element Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 20

  21. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. 1. System-level Architecture, interface Discuss data type next week Remember: FPGA doesn t like floating point! Use integer at least :D 2. Module-level Compute, memory 3. Bit-level Data type Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 21

  22. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. The Three Musketeers (i) Array partition (ii) Loop unroll (iii) Loop pipeline Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 22

  23. Array Partition Memory Parallelism Initially, an array is mapped to one (or more) block(s) of RAM (or BRAM on FPGA) o One block of RAM has at most two ports o At most two read/write operations can be done in one clock cycle Parallelism is 2 (too low) An array can be partitioned and mapped to multiple blocks of RAMs A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[0] A[4] A[1] A[5] A[2] A[6] A[3] A[7] 4 RAM blocks can be accessed simultaneously! 1 RAM block Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 23

  24. Array Partition Memory Parallelism Initially, an array is mapped to one (or more) block(s) of RAM (or BRAM on FPGA) o One block of RAM has at most two ports o At most two read/write operations can be done in one clock cycle Parallelism is 2 (too low) An array can be partitioned and mapped to multiple blocks of RAMs o Can also be partitioned into individual elements and mapped to registers Only if your array is small otherwise the tool will give up A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[0] A[4] A[1] A[5] A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[0] A[1] A[2] A[3] A[0] A[4] A[1] A[5] OR A[2] A[6] A[3] A[7] A[4] A[5] A[6] A[7] A[2] A[6] 4 RAM blocks can be accessed simultaneously! A[3] A[7] All registers 1 RAM block Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 24

  25. Loop Unrolling Loop unrolling to expose higher parallelism and achieve shorter latency o Pros Decrease loop overhead Increase parallelism for scheduling o Cons Increase operation count, which may negatively impact area, power, and timing Original Loop Unrolled Loop A[0] = B[0] + C[0]; A[1] = B[1] + C[1]; A[2] = B[2] + C[2]; ... for (int i = 0; i < N; i++) #pragma HLS unroll A[i] = B[i] + C[i]; N x m cycles m cycle Assume A[i] = B[i] + C[i] takes m cycle Only if A, B, and C are fully partitioned! Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 25

  26. Loop Pipelining Loop pipelining is one of the most important optimizations for high-level synthesis o Allows a new iteration to begin processing before the previous iteration is complete o Key metric: Initiation Interval (II) in # cycles for (i = 0; i < N; ++i) #pragma HLS pipeline p[i] = x[i] * y[i]; x[i] y[i] ld ld II = 1 ld st i=0 ld st i=1 st ld st i=2 p[i] ld i=3 st ld Load st Store cycles Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 26

  27. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency RAM block 1 A[0][0] A[1][0] A[M-1][0] A[0][1] A[1][1] A[M-1][1] A[0][N-1] A[1][N-1] A[M-1][N-1] A for (int i = 0; i < N; i++) { RAM block 2 B[0][0] B for (int j = 0; j < M; j++) { A[i][j] = B[i][j] * C[i][j]; } } RAM block 3 C[0][0] C Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 27

  28. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency RAM block 1 Compute in parallel A[0][0] A[1][0] A[M-1][0] A[0][1] A[1][1] A[M-1][1] A[0][N-1] A[1][N-1] A[M-1][N-1] for (int i = 0; i < N; i++) { RAM block 2 B[0][0] Compute in parallel for (int j = 0; j < M; j++) { #pragma HLS unroll A[i][j] = B[i][j] * C[i][j]; } } RAM block 3 C[0][0] Compute in parallel Memory ports limited by 2 Need to partition Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 28

  29. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency Block 2 Block 1 Block N A[0][N-1] A[1][N-1] A[M-1][N-1] A[0][0] A[1][0] A[M-1][0] A[0][1] A[1][1] A[M-1][1] #pragma HLS array_partition variable=A dim=2 complete #pragma HLS array_partition variable=B dim=2 complete #pragma HLS array_partition variable=C dim=2 complete for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { #pragma HLS unroll A[i][j] = B[i][j] * C[i][j]; } } B[0][0] B[0][0] Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 29

  30. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency #pragma HLS array_partition variable=A dim=2 complete #pragma HLS array_partition variable=B dim=2 complete #pragma HLS array_partition variable=C dim=2 complete i=0 i=2 i=1 for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { #pragma HLS unroll A[i][j] = B[i][j] * C[i][j]; } } Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 30

  31. Put-together: Pipeline + Unroll +Partition The three techniques are frequently used together to boost computation efficiency #pragma HLS array_partition variable=A dim=2 complete #pragma HLS array_partition variable=B dim=2 complete #pragma HLS array_partition variable=C dim=2 complete i=0 i=2 i=1 for (int i = 0; i < N; i++) { #pragma HLS pipeline II=1 for (int j = 0; j < 32; j++) { #pragma HLS unroll factor=8 A[i][j] = B[i][j] * C[i][j]; } } i=0 i=1 i=2 Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 31

  32. Hardware Specialization with HLS Where does performance gain come from? Specialization! Data type specialization o Arbitrary-precision fixed-point, custom floating-point Interface/communication specialization o Streaming, memory-mapped I/O, etc. Memory specialization o Array partitioning, data reuse, etc. Compute specialization o Unrolling, pipelining, dataflow, multithreading, etc. Architecture specialization o Pipelined, recursive, hybrid, etc. 1. System-level Architecture, interface 2. Module-level Compute, memory 3. Bit-level Data type Talk more in the future Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 32

  33. Summary HLS is good ! HLS is all about pragmas Optimization starting point: o Memory partition + loop unrolling + loop pipelining Next lecture: o More about loop optimization Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 33

Related


More Related Content