Efficient Training of Dense Linear Models on FPGA with Low-Precision Data

Slide Note
Embed
Share

Training dense linear models on FPGA with low-precision data offers increased hardware efficiency while maintaining statistical efficiency. This approach leverages stochastic rounding and multivariate trade-offs to optimize performance in machine learning tasks, particularly using Stochastic Gradient Descent (SGD). The advantages of low-precision data and FPGA implementation are highlighted, showcasing a promising direction for accelerating machine learning workloads.


Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. FPGA-accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang 1

  2. What is the most efficient way of training dense linear models on an FPGA? FPGAs can handle floating-point, but they are certainly not the best Yet! choice. We have tried: On par with a 10-core Xeon, not a clear win. How about some recent developments in the machine research? In theory Low-precision data EXACTLY the same end-result. leads to 2

  3. Stochastic Gradient Descent (SGD) N features while not converged do for i from 1 to M do ??= ?? ? ???? ? ? ??? end end ?? ? ?? 1 ?? 1 M samples x = Data Set Model Labels Gradient: dot(Ar, x)Ar Data Source DRAM, SSD, Sensor Memory CPU Cache, DRAM, FPGA BRAM Data Ar Model x Computation CPU, GPU, FPGA 3

  4. Advantage of Low-Precision Data + FPGA N features while not converged do for i from 1 to M do ??= ?? ? ???? ? ? ??? end end Compress! ? ?? 1 ?? 1 M samples x = Data Set Model Labels Gradient: dot(Ar, x)Ar Data Source DRAM, SSD, Sensor Memory CPU Cache, DRAM, FPGA BRAM Data Ar Model x Computation CPU, GPU, FPGA 4

  5. Key Takeaway Using an FPGA, we can increase the hardware efficiency while maintaining the statistical efficiency of SGD for dense linear models. Training Loss Training Loss Na ve rounding same quality Stochastic rounding Full-precision faster Iterations Execution Time In practice, the system opens up a multivariate trade-off space: Data properties, FPGA implementation, SGD parameters | | 5

  6. Outline for the Rest 1. Stochastic quantization. Data layout 2. Target platform: Intel Xeon+FPGA 3. Implementation: From 32-bit to 1-bit SGD on FPGA. 4. Evaluation: Main results. Side effects. 5. Conclusion 6

  7. Stochastic Quantization [1] 1 Naive solution: Nearest rounding (=1) => Converge to a different solution. [ . 0.7 .] Stochastic rounding: 0 with probability 0.3, 1 with probability 0.7 => Expectation remains the same. Converge to the same solution. 0 1 2 Gradient (?? ? ??)2 min ? ??= ?? ? ???? ? We need 2 ??= ?1(??) ? ???2(??) independent samples and each iteration of the algorithm needs fresh samples. 7 [1] H Zhang, K Kara, J Li, D Alistarh, J Liu, C Zhang: ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models, arXiv:1611.05402

  8. Before the FPGA: The Data Layout 1. Feature 2. Feature 3. Feature n. Feature 32-bit float value 32-bit float value 32-bit float value 1. Sample 32-bit float value 32-bit float value 32-bit float value 32-bit float value m. Sample 32-bit float value Currently slow! Quantize to 4-bit. We need 2 independent quantizations! 4x compression! n. Feature 3. Feature 1. Feature 2. Feature 1 quantization index 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 1. Sample For each iteration A new index! 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit m. Sample 8

  9. Target Platform ~30 GB/s 6.5 GB/s QPI Mem. Controller QPI 96GB Main Memory Endpoint Caches Accelerator Intel Xeon CPU Ivy Bridge, Xeon E5-2680 v2 10 cores, 25 MB L3 Altera Stratix V Intel Xeon+FPGA 9

  10. Full Precision SGD on FPGA Data Source DRAM, SSD, Sensor 32-bit floating-point: 16 values Processing rate: 12.8 GB/s 64B cache-line 16 floats 16 float multipliers 1 x x x x 16 float2fixed converters 16 fixed2float converters 2 C C C C C + + + + + + + + + + + + 3 + + 16 fixed adders + B A 4 + b fifo a fifo Dot 5 product c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x x 7 (ax-b) a Gradient calculation D 8 Batch size is reached? 16 float multipliers x x x Computation Custom Logic (ax-b)a 9 Storage Device FPGA BRAM 16 float2fixed converters C C C x - - - Model update 16 fixed adders C loading 10 x - (ax-b)a

  11. Scale out the design for low-precision data! How can we increase the internal parallelism while maintaining the processing rate? | | 11

  12. Challenge + Solution 64B cache-line 8-bit: 32 values 4-bit: 64 values 2-bit: 128 values 1-bit: 256 values Q8 Q4 Q2 Q1 16 floats 16 float multipliers 1 x x x x 16 float2fixed converters 16 fixed2float converters 2 C C C C C + + + + + + + + + + + + 3 + + 16 fixed adders + B A => Scaling out is not trivial! 4 + b fifo a fifo Dot 5 product c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x 1) We can get rid of floating- point arithmetic. 2) We can further simplify integer arithmetic for lowest precision data. x 7 (ax-b) a Gradient calculation D 8 Batch size is reached? 16 float multipliers x x x (ax-b)a 9 16 float2fixed converters C C C x - - - Model update 16 fixed adders C loading 12 x - (ax-b)a

  13. Trick 1: Selection of quantization levels U 1. Select the lower and upper bound [L,U] 2. Select the size of the interval All quantized values are integers! L 13

  14. This is enough for Q4 and Q8 64B cache-line K=128, 64, 32 quantized values 1 K fixed multipliers x x x x + + + + + + + + + + + + A Dot 8-bit: 32 values 4-bit: 64 values 2-bit: 128 values 1-bit: 256 values Q8 Q4 Q2 Q1 + K fixed adders b fifo a fifo product + Q(a)x b x - 1 fixed adder Q(a) 1 bit-shift 2 x Gradient calculation (Q(a)x-b) Batch size is reached? K fixed multipliers 3 x x x (Q(a)x-b)Q(a) x - - - K fixed adders Model update loading x - (Q(a)x-b)Q(a) 14

  15. Trick 2: Implement Multiplication using Multiplexer Q2value[1:0] Q2value normalized to [0,2] or [-1,1] 0 00 01 10 <<1 Out[31:0] Q1value[0] In[31:0] 1 0 0 Q2value[1:0] Result[31:0] Result[31:0] 2 In[31:0] 1 0 00 01 11 In[31:0] -In[31:0] Out[31:0] Q2 multiplier Q1 multiplier 15

  16. Support for all Qx 64B cache-line 64B cache-line K=128, 64, 32 quantized values K=256 quantized values 1 Split one 64B cache-line into 2 parts K fixed multipliers x x x x 4 K=128 quantized values K=128 quantized values + + + + + + + + + + 128 fixed multipliers 1 x x x x + + A Dot + K fixed adders + + + + + + b fifo a fifo product + + + + + + + A Dot Q(a)x b + 128 fixed adders b fifo a fifo product x + - 1 fixed adder Q(a) Q(a)x b 1 bit-shift x 2 x Gradient calculation - 1 fixed adder Q(a) 1 bit-shift (Q(a)x-b) 2 Batch size is reached? x Gradient calculation (Q(a)x-b) Batch size is reached? K fixed multipliers 3 x x x 128 fixed multipliers 3 (Q(a)x-b)Q(a) x x x x (Q(a)x-b)Q(a) - - - K fixed adders Model update x - - - loading 128 fixed adders Model update x - (Q(a)x-b)Q(a) loading x - (Q(a)x-b)Q(a) Circuit for Q8, Q4 and Q2 SGD Circuit for Q1 SGD Processing rate: 12.8 GB/s Processing rate: 6.4 GB/s 16

  17. Support for all Qx 64B cache-line 64B cache-line K=128, 64, 32 quantized values K=256 quantized values 1 Split one 64B cache-line into 2 parts K fixed multipliers x x x x 4 K=128 quantized values K=128 quantized values + + + + + + + + + + 128 fixed multipliers 1 x x x x + + A Dot + K fixed adders + + + + + + b fifo a fifo product + + + + + Data Type Logic DSP BRAM + + A Dot Q(a)x b + 128 fixed adders 7% 6% 6% 6% b fifo a fifo product x + - 1 fixed adder float Q8 Q4 38% 35% 36% 43% 12% 25% 50% 1% Q(a) Q(a)x b 1 bit-shift x 2 x Gradient calculation - 1 fixed adder Q(a) 1 bit-shift (Q(a)x-b) 2 Batch size is reached? x Gradient calculation (Q(a)x-b) Batch size is reached? K fixed multipliers 3 x x x 128 fixed multipliers 3 (Q(a)x-b)Q(a) x x x x (Q(a)x-b)Q(a) - - - K fixed adders Model update x - - - loading 128 fixed adders Model update x - (Q(a)x-b)Q(a) Q2, Q1 loading x - (Q(a)x-b)Q(a) Circuit for Q8, Q4 and Q2 SGD Circuit for Q1 SGD Processing rate: 12.8 GB/s Processing rate: 6.4 GB/s 17

  18. Data sets for evaluation Name Size # Features # Classes MNIST GISETTE EPSILON SYN100 SYN1000 60,000 6,000 10,000 10,000 10,000 780 5,000 2,000 100 1,000 10 2 2 Regression Regression Following the Intel legal guidelines on publishing performance numbers, we would like to make the reader aware that results in this publication were generated using preproduction hardware and software, and may not reflect the performance of production or future systems. 18

  19. SGD Performance Improvement 0.030 0.014 float CPU 1-thread float CPU 10-threads Q4 FPGA float FPGA float CPU 1-thread float CPU 10-threads Q8 FPGA float FPGA 0.025 0.012 Training Loss Training Loss 0.020 0.010 0.015 0.008 0.006 0.010 0.004 0.005 0.002 0.000 2x 7.2x 0.000 1E-4 0.001 0.01 0.1 0.001 0.01 0.1 1 Time (s) Time (s) SGD on Synthetic100, 4-bit works SGD on Synthetic1000, 8-bit works 19

  20. 1-bit also works on some data sets 0.0215 0.0210 float CPU 1-thread float CPU 10-threads Q1 FPGA float FPGA 0.0205 Training Loss 0.0200 0.0195 0.0190 0.0185 MNIST 0.0180 multi-classification accuracy 0.0175 10.6x Precision float CPU-SGD 1-bit FPGA-SGD Accuracy 85.82% 85.87% Training Time 19.04 s 2.41 s 0.0170 0.001 0.01 0.1 1 10 Time (s) SGD on MNIST, Digit 7 1-bit works 20

  21. Nave Rounding vs. Stochastic Rounding float Q2, Q4, Q8, 0.25 F2 F4 F8 Stochastic quantization results in unbiased convergence. Training Loss 0.20 0.15 0.10 Bias 0.05 0 10 20 30 40 50 60 Number of Epochs 21

  22. Effect of Step Size 0.1 float, step size=1/29, float, step size=1/212, float, step size=1/215, Q2, step size=2/29 Q2, step size=2/212 Large step size + Full precision vs. Small step size + Low precision Q2, step size=2/215 Training Loss 0.01 0.001 0 10 20 30 40 50 60 Number of Epochs 22

  23. Conclusion Highly scalable, parametrizable FPGA-based stochastic gradient descent implementations for doing linear model training. Open source: www.systems.ethz.ch/fpga/ZipML_SGD Key Takeaways: 1. The way to train linear models on FPGA should be through the usage of stochastically rounded, low-precision data. 2. Multivariate trade-off space: Precision vs. end-to-end runtime, convergence quality, design complexity, data and system properties. | | 23

  24. | | 24

  25. Backup Why do dense linear models matter? Workhorse algorithm for regression and classification. Sparse, high dimensional data sets can be converted into dense ones. Random Features Transfer Learning Why is training speed important? Often a human-in-the-loop process. Train Configuring parameters, selecting features, data Evaluate dependency etc. 25

  26. Effect of Reusing Indexes 0.025 float Q2, 64 indexes Q2, 32 indexes Q2, 16 indexes Q2, 8 indexes Q2, 4 indexes Q2, 2 indexes 0.020 Training Loss In practice 8 indexes are enough! 0.015 0.010 0.005 0.000 0 10 20 30 40 50 60 Number of Epochs 26

Related


More Related Content