Efficient Training of Dense Linear Models on FPGA with Low-Precision Data

Kaan Kara

, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang

FPGA-accelerated Dense Linear Machine Learning:

A Precision-Convergence Trade-off

What is the most efficient way of training dense linear

models on an FPGA?

•

FPGAs can handle floating-point, but they are certainly not the best

choice.

We have tried: On par with a 10-core Xeon, not a clear win.

Yet!

•

How about some recent developments in the machine research?

Low-precision data

EXACTLY the same end-result.

In theory

leads to

N features

M samples

Data Set

Labels

Model

Data

Model

Gradient:

dot(A

x)A

Data Set

Labels

Model

Data

Model

Gradient:

dot(A

x)A

Compress!

N features

M samples

•

Using an FPGA, we can increase the

hardware efficiency

 while maintaining

the

statistical efficiency

 of SGD for dense linear models.

•

In practice, the system opens up a multivariate trade-off space:

Data properties, FPGA implementation, SGD parameters…

…

1.

Stochastic quantization. Data layout

2.

Target platform: Intel Xeon+FPGA

3.

Implementation: From 32-bit to 1-bit SGD on FPGA.

4.

Evaluation: Main results. Side effects.

5.

Conclusion

Naive solution:

earest rounding (=1)

=>

 Converge to a different solution

Stochastic rounding:

0 with prob

ability

0.3

1 with prob

ability

0.7

=> Expectation remains the same. Converge to the same solution.

Gradient

We need 2

independent samples

…and each iteration of the algorithm needs fresh samples.

[1]

H Zhang, K Kara, J Li, D Alistarh, J Liu, C Zhang: ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models, arXiv:

1611.05402

32-bit

float

 value

1. Feature

1. Sample

32-bit

float

 value

2. Feature

32-bit

float

 value

3. Feature

32-bit

float

 value

n. Feature

32-bit

float

 value

m. Sample

32-bit

float

 value

32-bit

float

 value

32-bit

float

 value

Quantize to 4-bit.

We need 2 independent quantizations!

4-bit | 4-bit

1. Feature

1. Sample

2. Feature

m. Sample

4x compression!

3. Feature

4-bit | 4-bit

4-bit | 4-bit

n. Feature

4-bit | 4-bit

4-bit | 4-bit

4-bit | 4-bit

4-bit | 4-bit

4-bit | 4-bit

For each iteration

A new index!

1 quantization index

Currently

slow!

Intel Xeon+FPGA

Ivy Bridge, Xeon E5-2680 v2

10 cores, 25 MB L3

32-bit floating-point: 16 values

Processing rate: 12.8 GB/s

Scale out the design for low-precision data!

How can we increase the internal parallelism while

maintaining the processing rate?

8-bit: 32 values

Q8

4-bit: 64 values

Q4

2-bit: 128 values

Q2

1-bit: 256 values

Q1

=> Scaling out is not trivial!

1)

We can get rid of floating-

point arithmetic.

2)

We can further simplify

integer arithmetic for lowest

precision data.

1.

Select the lower and upper bound [L,U]

2.

Select the size of the interval

∆

All quantized values are integers!

∆

8-bit: 32 values

Q8

4-bit: 64 values

Q4

2-bit: 128 values

Q2

1-bit: 256 values

Q1

Q2 multiplier

Q1 multiplier

Circuit for Q8, Q4 and Q2 SGD

Processing rate: 12.8 GB/s

Circuit for Q1 SGD

Processing rate: 6.4 GB/s

Circuit for Q8, Q4 and Q2 SGD

Processing rate: 12.8 GB/s

Circuit for Q1 SGD

Processing rate: 6.4 GB/s

Following the Intel legal guidelines on publishing performance numbers, we would like to make the reader

aware that results in this publication were generated using preproduction hardware and software, and may

not reﬂect the performance of production or future systems.

SGD on Synthetic100,

4-bit works

SGD on Synthetic1000,

8-bit works

SGD on MNIST, Digit 7

1-bit works

MNIST

multi-classification accuracy

ï

Bias

Stochastic

quantization

results in

unbiased

convergence.

Large step size +

Full precision

vs.

Small step size +

Low precision

•

Highly scalable, parametrizable FPGA-based stochastic gradient descent

implementations for doing linear model training.

•

Open source: www.systems.ethz.ch/fpga/ZipML_SGD

1.

The way to train linear models on FPGA should be through the usage of

stochastically rounded, low-precision data.

2.

Multivariate trade-off space: Precision vs. end-to-end runtime, convergence quality,

design complexity, data and system properties.

Key Takeaways:

Why do dense linear models matter?

•

Workhorse algorithm for regression and classification.

•

Sparse, high dimensional data sets can be converted into

dense ones.

Transfer Learning

Random Features

Why is training speed important?

•

Often a human-in-the-loop process.

•

Configuring parameters, selecting features, data

dependency etc.

Train

Evaluate

In practice 8

indexes are

enough!

Slide Note

Embed Share

Download

Training dense linear models on FPGA with low-precision data offers increased hardware efficiency while maintaining statistical efficiency. This approach leverages stochastic rounding and multivariate trade-offs to optimize performance in machine learning tasks, particularly using Stochastic Gradient Descent (SGD). The advantages of low-precision data and FPGA implementation are highlighted, showcasing a promising direction for accelerating machine learning workloads.

fvier Follow

Uploaded on Oct 10, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

FPGA-accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang 1

What is the most efficient way of training dense linear models on an FPGA? FPGAs can handle floating-point, but they are certainly not the best Yet! choice. We have tried: On par with a 10-core Xeon, not a clear win. How about some recent developments in the machine research? In theory Low-precision data EXACTLY the same end-result. leads to 2

Stochastic Gradient Descent (SGD) N features while not converged do for i from 1 to M do ??= ?? ? ???? ? ? ??? end end ?? ? ?? 1 ?? 1 M samples x = Data Set Model Labels Gradient: dot(Ar, x)Ar Data Source DRAM, SSD, Sensor Memory CPU Cache, DRAM, FPGA BRAM Data Ar Model x Computation CPU, GPU, FPGA 3

Advantage of Low-Precision Data + FPGA N features while not converged do for i from 1 to M do ??= ?? ? ???? ? ? ??? end end Compress! ? ?? 1 ?? 1 M samples x = Data Set Model Labels Gradient: dot(Ar, x)Ar Data Source DRAM, SSD, Sensor Memory CPU Cache, DRAM, FPGA BRAM Data Ar Model x Computation CPU, GPU, FPGA 4

Key Takeaway Using an FPGA, we can increase the hardware efficiency while maintaining the statistical efficiency of SGD for dense linear models. Training Loss Training Loss Na ve rounding same quality Stochastic rounding Full-precision faster Iterations Execution Time In practice, the system opens up a multivariate trade-off space: Data properties, FPGA implementation, SGD parameters | | 5

Outline for the Rest 1. Stochastic quantization. Data layout 2. Target platform: Intel Xeon+FPGA 3. Implementation: From 32-bit to 1-bit SGD on FPGA. 4. Evaluation: Main results. Side effects. 5. Conclusion 6

Stochastic Quantization [1] 1 Naive solution: Nearest rounding (=1) => Converge to a different solution. [ . 0.7 .] Stochastic rounding: 0 with probability 0.3, 1 with probability 0.7 => Expectation remains the same. Converge to the same solution. 0 1 2 Gradient (?? ? ??)2 min ? ??= ?? ? ???? ? We need 2 ??= ?1(??) ? ???2(??) independent samples and each iteration of the algorithm needs fresh samples. 7 [1] H Zhang, K Kara, J Li, D Alistarh, J Liu, C Zhang: ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models, arXiv:1611.05402

Before the FPGA: The Data Layout 1. Feature 2. Feature 3. Feature n. Feature 32-bit float value 32-bit float value 32-bit float value 1. Sample 32-bit float value 32-bit float value 32-bit float value 32-bit float value m. Sample 32-bit float value Currently slow! Quantize to 4-bit. We need 2 independent quantizations! 4x compression! n. Feature 3. Feature 1. Feature 2. Feature 1 quantization index 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 1. Sample For each iteration A new index! 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit m. Sample 8

Target Platform ~30 GB/s 6.5 GB/s QPI Mem. Controller QPI 96GB Main Memory Endpoint Caches Accelerator Intel Xeon CPU Ivy Bridge, Xeon E5-2680 v2 10 cores, 25 MB L3 Altera Stratix V Intel Xeon+FPGA 9

Full Precision SGD on FPGA Data Source DRAM, SSD, Sensor 32-bit floating-point: 16 values Processing rate: 12.8 GB/s 64B cache-line 16 floats 16 float multipliers 1 x x x x 16 float2fixed converters 16 fixed2float converters 2 C C C C C + + + + + + + + + + + + 3 + + 16 fixed adders + B A 4 + b fifo a fifo Dot 5 product c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x x 7 (ax-b) a Gradient calculation D 8 Batch size is reached? 16 float multipliers x x x Computation Custom Logic (ax-b)a 9 Storage Device FPGA BRAM 16 float2fixed converters C C C x - - - Model update 16 fixed adders C loading 10 x - (ax-b)a

Scale out the design for low-precision data! How can we increase the internal parallelism while maintaining the processing rate? | | 11

Challenge + Solution 64B cache-line 8-bit: 32 values 4-bit: 64 values 2-bit: 128 values 1-bit: 256 values Q8 Q4 Q2 Q1 16 floats 16 float multipliers 1 x x x x 16 float2fixed converters 16 fixed2float converters 2 C C C C C + + + + + + + + + + + + 3 + + 16 fixed adders + B A => Scaling out is not trivial! 4 + b fifo a fifo Dot 5 product c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x 1) We can get rid of floating- point arithmetic. 2) We can further simplify integer arithmetic for lowest precision data. x 7 (ax-b) a Gradient calculation D 8 Batch size is reached? 16 float multipliers x x x (ax-b)a 9 16 float2fixed converters C C C x - - - Model update 16 fixed adders C loading 12 x - (ax-b)a

Trick 1: Selection of quantization levels U 1. Select the lower and upper bound [L,U] 2. Select the size of the interval All quantized values are integers! L 13

This is enough for Q4 and Q8 64B cache-line K=128, 64, 32 quantized values 1 K fixed multipliers x x x x + + + + + + + + + + + + A Dot 8-bit: 32 values 4-bit: 64 values 2-bit: 128 values 1-bit: 256 values Q8 Q4 Q2 Q1 + K fixed adders b fifo a fifo product + Q(a)x b x - 1 fixed adder Q(a) 1 bit-shift 2 x Gradient calculation (Q(a)x-b) Batch size is reached? K fixed multipliers 3 x x x (Q(a)x-b)Q(a) x - - - K fixed adders Model update loading x - (Q(a)x-b)Q(a) 14

Trick 2: Implement Multiplication using Multiplexer Q2value[1:0] Q2value normalized to [0,2] or [-1,1] 0 00 01 10 <<1 Out[31:0] Q1value[0] In[31:0] 1 0 0 Q2value[1:0] Result[31:0] Result[31:0] 2 In[31:0] 1 0 00 01 11 In[31:0] -In[31:0] Out[31:0] Q2 multiplier Q1 multiplier 15

Support for all Qx 64B cache-line 64B cache-line K=128, 64, 32 quantized values K=256 quantized values 1 Split one 64B cache-line into 2 parts K fixed multipliers x x x x 4 K=128 quantized values K=128 quantized values + + + + + + + + + + 128 fixed multipliers 1 x x x x + + A Dot + K fixed adders + + + + + + b fifo a fifo product + + + + + + + A Dot Q(a)x b + 128 fixed adders b fifo a fifo product x + - 1 fixed adder Q(a) Q(a)x b 1 bit-shift x 2 x Gradient calculation - 1 fixed adder Q(a) 1 bit-shift (Q(a)x-b) 2 Batch size is reached? x Gradient calculation (Q(a)x-b) Batch size is reached? K fixed multipliers 3 x x x 128 fixed multipliers 3 (Q(a)x-b)Q(a) x x x x (Q(a)x-b)Q(a) - - - K fixed adders Model update x - - - loading 128 fixed adders Model update x - (Q(a)x-b)Q(a) loading x - (Q(a)x-b)Q(a) Circuit for Q8, Q4 and Q2 SGD Circuit for Q1 SGD Processing rate: 12.8 GB/s Processing rate: 6.4 GB/s 16

Support for all Qx 64B cache-line 64B cache-line K=128, 64, 32 quantized values K=256 quantized values 1 Split one 64B cache-line into 2 parts K fixed multipliers x x x x 4 K=128 quantized values K=128 quantized values + + + + + + + + + + 128 fixed multipliers 1 x x x x + + A Dot + K fixed adders + + + + + + b fifo a fifo product + + + + + Data Type Logic DSP BRAM + + A Dot Q(a)x b + 128 fixed adders 7% 6% 6% 6% b fifo a fifo product x + - 1 fixed adder float Q8 Q4 38% 35% 36% 43% 12% 25% 50% 1% Q(a) Q(a)x b 1 bit-shift x 2 x Gradient calculation - 1 fixed adder Q(a) 1 bit-shift (Q(a)x-b) 2 Batch size is reached? x Gradient calculation (Q(a)x-b) Batch size is reached? K fixed multipliers 3 x x x 128 fixed multipliers 3 (Q(a)x-b)Q(a) x x x x (Q(a)x-b)Q(a) - - - K fixed adders Model update x - - - loading 128 fixed adders Model update x - (Q(a)x-b)Q(a) Q2, Q1 loading x - (Q(a)x-b)Q(a) Circuit for Q8, Q4 and Q2 SGD Circuit for Q1 SGD Processing rate: 12.8 GB/s Processing rate: 6.4 GB/s 17

Data sets for evaluation Name Size # Features # Classes MNIST GISETTE EPSILON SYN100 SYN1000 60,000 6,000 10,000 10,000 10,000 780 5,000 2,000 100 1,000 10 2 2 Regression Regression Following the Intel legal guidelines on publishing performance numbers, we would like to make the reader aware that results in this publication were generated using preproduction hardware and software, and may not reflect the performance of production or future systems. 18

SGD Performance Improvement 0.030 0.014 float CPU 1-thread float CPU 10-threads Q4 FPGA float FPGA float CPU 1-thread float CPU 10-threads Q8 FPGA float FPGA 0.025 0.012 Training Loss Training Loss 0.020 0.010 0.015 0.008 0.006 0.010 0.004 0.005 0.002 0.000 2x 7.2x 0.000 1E-4 0.001 0.01 0.1 0.001 0.01 0.1 1 Time (s) Time (s) SGD on Synthetic100, 4-bit works SGD on Synthetic1000, 8-bit works 19

1-bit also works on some data sets 0.0215 0.0210 float CPU 1-thread float CPU 10-threads Q1 FPGA float FPGA 0.0205 Training Loss 0.0200 0.0195 0.0190 0.0185 MNIST 0.0180 multi-classification accuracy 0.0175 10.6x Precision float CPU-SGD 1-bit FPGA-SGD Accuracy 85.82% 85.87% Training Time 19.04 s 2.41 s 0.0170 0.001 0.01 0.1 1 10 Time (s) SGD on MNIST, Digit 7 1-bit works 20

Nave Rounding vs. Stochastic Rounding float Q2, Q4, Q8, 0.25 F2 F4 F8 Stochastic quantization results in unbiased convergence. Training Loss 0.20 0.15 0.10 Bias 0.05 0 10 20 30 40 50 60 Number of Epochs 21

Effect of Step Size 0.1 float, step size=1/29, float, step size=1/212, float, step size=1/215, Q2, step size=2/29 Q2, step size=2/212 Large step size + Full precision vs. Small step size + Low precision Q2, step size=2/215 Training Loss 0.01 0.001 0 10 20 30 40 50 60 Number of Epochs 22

Conclusion Highly scalable, parametrizable FPGA-based stochastic gradient descent implementations for doing linear model training. Open source: www.systems.ethz.ch/fpga/ZipML_SGD Key Takeaways: 1. The way to train linear models on FPGA should be through the usage of stochastically rounded, low-precision data. 2. Multivariate trade-off space: Precision vs. end-to-end runtime, convergence quality, design complexity, data and system properties. | | 23

| | 24

Backup Why do dense linear models matter? Workhorse algorithm for regression and classification. Sparse, high dimensional data sets can be converted into dense ones. Random Features Transfer Learning Why is training speed important? Often a human-in-the-loop process. Train Configuring parameters, selecting features, data Evaluate dependency etc. 25

Effect of Reusing Indexes 0.025 float Q2, 64 indexes Q2, 32 indexes Q2, 16 indexes Q2, 8 indexes Q2, 4 indexes Q2, 2 indexes 0.020 Training Loss In practice 8 indexes are enough! 0.015 0.010 0.005 0.000 0 10 20 30 40 50 60 Number of Epochs 26

Efficient Training of Dense Linear Models on FPGA with Low-Precision Data

Download Presentation

Presentation Transcript

Related

More Related Content