Memory and Data Movement Optimization Seminar by Callie Hao at Georgia Tech
Callie Hao, Assistant Professor at Georgia Tech, will host a seminar on Memory and Data Movement Optimization. The seminar will cover topics such as Burst Mode, Wide Bus Transactions, and more. Important dates for proposal and paper presentations are also shared. Attendees will engage in discussions on formal projects. Check the provided links for more details.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
L10: Memory and Data Movement Optimization Cong (Callie) Hao callie.hao@ece.gatech.edu Assistant Professor ECE, Georgia Institute of Technology Sharc-lab @ Georgia Tech https://sharclab.ece.gatech.edu/
Logistics I ll be away from this Thursday till next Saturday o So I ll have to miss threeclasses sorry! This Thursday (2/9): o Convolution by Akshay He s the one who developed lab2 last year so he knows everything Next Tuesday (2/14): o Convolution continue + Lab1 review Next Thursday (2/16): o Informal project discussion We will have several project leaders coming to the class and discuss their projects Start to team up Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 2
Important Dates Proposal presentation: o March 2, 7, 9, 14, 16 (will finish before spring break) o ~10 min, max 13 min (hard cut) o Presentation 5 pt, report 5 pt Final Presentation: o May 3, 2:40 5:30 pm Can do it earlier if needed o ~20 min o 10 pt Paper presentation: o March 28, March 31, Apr. 5 o ~15 min, max 17 min (hard cut) o Related to your proposal (hopefully, not required) o 10 pt Will release a sign-up sheet In person Please specify individual team member s contribution not necessarily true that all members get the same score Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 3
Outline Burst mode Wide bus transaction Double Buffer Streaming C/RTL Co-simulation Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 4
Burst Mode for Continuous Memory Access 32 bit voidtest( FIX_TYPE A[100][100], ...) { #pragma HLS interface m_axi port=A offset=slave bundle=mem FIX_TYPE A_local[100][100]; for(int i = 0; i < 100; i++) { for(int j = 0; j < 100; j++) { A_local[i][j] = A[i][j]; } } Multiple burst reads of length 10000 and bit width 32 in loop xxx has been inferred on port 'mem' One data, one cycle } ... Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 5
Bad Practice > 12000 cycles (expected 5000) for(int i = 0; i < 100; i++) { for(int j = 0; j < 100; j+=2) { A_local[i][j] = A[i][j]; } } Do not burst because memory is not continuous for(int j = 0; j < 100; j++) { for(int i = 0; i < 100; i++) { A_local[i][j] = A[i][j]; } } > 20000 cycles (expected 10000) Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 6
Wider Bus Transaction 32 bit voidtest( FIX_TYPE A[100][100], ...) { #pragma HLS interface m_axi port=A offset=slave bundle=mem FIX_TYPE A_local[100][100]; for(int i = 0; i < 100; i++) { for(int j = 0; j < 100; j++) { A_local[i][j] = A[i][j]; } } Multiple burst reads of length 10000 and bit width 32 in loop xxx has been inferred on port 'mem' One 32-bit data per cycle } ... AXI bus is 512 bit wide 15/16 bandwidth is wasted! Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 7
Wider Bus Transaction Create a customized wide data type 320 bit typedef ap_uint<320> MEM_TYPE; void test( MEM_TYPE A[100*10], ...) { #pragma HLS interface m_axi port=A offset=slave bundle=mem Reorganize into 1D array FIX_TYPE A_local[100*100]; #pragma HLS array_partition variable=A_local cyclic factor=10 for(int i = 0; i < 100*10; i++) { #pragma HLS pipeline MEM_TYPE data = A[i]; for(int ii = 0; ii < 10; ii++) { A_local[i*10 + ii] = data.range(0 + (ii*32), 31 + (ii*32)); } }} ... } Read 320 bit at each cycle [INFO] Multiple burst reads of length 1000 and bit width 512 Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 8
Wider Bus Transaction 320-bit customized datatype 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b 32b ... ... ... ... ... ... ... ... ... ... Partitioned local arrays Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 9
Single Buffer Limitation Read-Execute-Write create dependency Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 10
Single Buffer Limitation Read-Execute-Write create dependency Execution PE Write PE Read PE Output Buffer (BRAM) Input Buffer (BRAM) DRAM DRAM *PE: processing element Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 11
Single Buffer Limitation Read-Execute-Write create dependency Read Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 12
Single Buffer Limitation Read-Execute-Write create dependency Execute Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 13
Single Buffer Limitation Read-Execute-Write create dependency Write Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 14
Overlap Read and Write (Pre-fetching) for(int i = 0; i < N; i++) { read(buf_A, i); execute(buf_A, buf_B); write(buf_B, i); } read(buf_A, i); for(int i = 0; i < N; i++) { execute(buf_A, buf_B); write(buf_B, i); if(i < N-1) read(buf_A, i+1); } Read Read Execute Execute Read Write Write Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 15
Overlap Read and Write (Pre-fetching) for(int i = 0; i < N; i++) { read(buf_A, i); execute(buf_A, buf_B); write(buf_B, i); } read(buf_A, i); for(int i = 0; i < N; i++) { execute(buf_A, buf_B); write(buf_B, i); if(i < N-1) read(buf_A, i+1); } Read Read How to resolve this dependency? Execute Execute Hidden Read Write Write Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 16
Double Buffer (Ping-Pong Buffer) Read Execute Write Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 17
Double Buffer (Ping-Pong Buffer) Read Execute Write Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 18
Double Buffer (Ping-Pong Buffer) Read Read Execute Execute Write Write Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 19
Double Buffer (Ping-Pong Buffer) read(buf_A_Ping, i); for(int i = 0; i < N; i++) { if( i % 2 == 0 ) { execute(buf_A_Ping, buf_B); write(buf_B, i); if( i < N-1 ) read(buf_A_Pong, i+1); } else if (i % 2 == 1) { execute(buf_A_Pong, buf_B); write(buf_B, i); if( i < N-1 ) read(buf_A_Ping, i+1); } } Hidden Hidden Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 20
Double Buffer (Ping-Pong Buffer) read(buf_A_Ping, i); for(int i = 0; i < N; i++) { if( i % 2 == 0 ) { execute(buf_A_Ping, buf_B_Ping); if( i > 0 ) write(buf_B_Pong, i-1); if( i < N-1 ) read(buf_A_Pong, i+1); } else if (i % 2 == 1) { execute(buf_A_Pong, buf_B_Pong); if( i > 0 ) write(buf_B_Ping, i-1); if( i < N-1 ) read(buf_A_Ping, i+1); } } if( xxx ) write(buf_B_Ping, i); else write(buf_B_Pong; i); Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 21
Ping-Pong Buffer Pros and Cons Pros o Overlapped execution shorter latency Cons o Memory overhead (4x) o Programmer s effort Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 22
Producer-Consumer Paradigm Another way to explain overlapped execution Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 23
Data Streaming for Producer-Consumer Stream: an unbounded, continuously updating data set o Unbounded means of unknown or of unlimited size o A sequence of data flowing unidirectionally between a source (producer) process and a destination (consumer) process Example: real-time video, audio, etc. Started processing don t wait for the entire Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 24
Data Streaming for Producer-Consumer Enabled by FIFO (first-in first-out) buffers o The consumer process can start accessing the data inside the FIFO buffer as soon as the producer inserts the data into the buffer o If the buffer is full/empty, automatically stalls FIFO FIFO FIFO Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 25
Data Streaming in HLS Step 1: create FIFOs using hls::stream<type> o Specify a depth (how large the FIFO is) Step 2: organize into two functions or loops o One writing to FIFO, one reading from FIFO Step 3: apply dataflow pragma Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 26
Data Streaming in HLS Example void test( FIX_TYPE* A, FIX_TYPE* sum ) { #pragma HLS dataflow hls::stream<FIX_TYPE> buffer; #pragma HLS STREAM variable=buffer depth=10 READ_A: for(int i = 0; i < 100; i++) { buffer.write(A[i]); } COMPUTE_RES: for(int i = 0; i < 100; i++) { FIX_TYPE d = buffer.read(); FIX_TYPE res = d * d + i; sum[i] = res; } } Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 27
Data Streaming in HLS Example void test( FIX_TYPE* A, FIX_TYPE* sum ) { #pragma HLS dataflow hls::stream<FIX_TYPE> buffer; #pragma HLS STREAM variable=buffer depth=10 READ_A: for(int i = 0; i < 100; i++) { buffer.write(A[i]); } COMPUTE_RES: for(int i = 0; i < 100; i++) { FIX_TYPE d = buffer.read(); FIX_TYPE res = d * d + i; sum[i] = res; } } Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 28
Data Streaming in HLS Example void test( FIX_TYPE* A, FIX_TYPE* sum ) { #pragma HLS dataflow hls::stream<FIX_TYPE> buffer; #pragma HLS STREAM variable=buffer depth=10 READ_A: for(int i = 0; i < 100; i++) { buffer.write(A[i]); } COMPUTE_RES: for(int i = 0; i < 100; i++) { FIX_TYPE d = buffer.read(); FIX_TYPE res = d * d + i; sum[i] = res; } } Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 29
Pitfalls of Dataflow and Streaming Single Producer, Single Consumer everything must be streamlined, no bypass No Feedback between tasks No Conditional execution of tasks No Loops with multiple exit conditions FIFO FIFO FIFO Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 30
Pitfalls of Dataflow and Streaming Single Producer, Single Consumer everything must be streamlined, no bypass No Feedback between tasks No Conditional execution of tasks No Loops with multiple exit conditions FIFO FIFO FIFO Dataflow Region Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 31
Pitfalls of Dataflow and Streaming Single Producer, Single Consumer everything must be streamlined, no bypass No Feedback between tasks No Conditional execution of tasks No Loops with multiple exit conditions FIFO FIFO FIFO Dataflow Region Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 32
Pitfalls of Dataflow and Streaming Single Producer, Single Consumer everything must be streamlined, no bypass No Feedback between tasks No Conditional execution of tasks No Loops with multiple exit conditions FIFO FIFO FIFO Dataflow Region Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 33
Canonical Forms void dataflow_top(Input0, Input1, Output0, Output1) { for (int i = 0; i < N; i++) { #pragma HLS dataflow func1 Streaming_Buffer C0, C1, C2; C0 C1 func1(read_Input0, read_Input1, write_C0, write_C1); func1 func2(read_C0, read C1, write_C2); C2 func3(read_C2, write_Output0, write_Output1); func1 } } Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 34
Resource Utilization Resource utilization (DSP, LUT, BRAM) and latency value o These are only estimation by Vitis HLS not reliable! I m surprised by how inaccurate the estimation could be (they used to be fine at least for DSPs ) What to do?! o C/RTL co-simulation to be introduced later o Apply implementation really map those generated RTL onto FPGA Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 35
Synthesis v.s. Implementation This is what you get from Synthesis Performance and Resource Estimation Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 36
Synthesis v.s. Implementation To run Implementation If run into errors: o Try lastyear vitis_hls Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 37
Synthesis v.s. Implementation Implementation Synthesis Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 38
Summary Widening the memory port Double buffer to hide the data loading latency More advanced (challenging) technique: streaming and dataflow BUT! Dataflow architecture is really efficient and will be very useful in your design! o Callie Hao | Sharc-lab @ Georgia Institute of Technology https://sharclab.ece.gatech.edu/ 39