Integrated Reduction Technique for Double Precision Accumulator

undefined

An Integrated Reduction Technique

for a Double Precision Accumulator

Krishna Nagar, Yan Zhang, Jason Bakos

Dept. of Computer Science and Engineering

University of South Carolina

Double Precision Accumulation

•

Many kernels targeted for acceleration include

•

For large datasets, values delivered serially to an

accumulator

HPRCTA ’09

A,

set 1

Σ

B,

set 1

C,

set 1

D,

set 2

E,

set 2

F,

set 2

G,

set 3

A+B

+C,

set 1

D+E

+F,

set 2

H,

set 3

I,

set 3

G+H

+I,

set 3

The Reduction Problem

HPRCTA ’09

Partial

sums

Reduction-Based Accumulator:  Previous Work

HPRCTA ’09

Approach

•

Reduction complexity scales with the latency of the core operation

–

Reduce  latency of double precision add?

•

IEEE 754 adder pipeline (assume 4-bit significand):

HPRCTA ’09

Compare

exponents

Add 53-bit

mantissas

De-

normalize

smaller

value

Round

Re-

normalize

1.1011 x 2

1.1110 x 2

1.1011   x 2

0.01111 x 2

10.00101 x 2

10.0011 x 2

1.00011 x 2

Round

1.0010 x 2

Adder Pipeline

HPRCTA ’09

•

Mantissa addition

–

Cascaded, pipelined DSP48 adders

–

Scales well, operates fast

•

De-normalize

–

Exponent comparison and a variable shift of one

significand

–

Xilinx IP uses a DSP48 for the 11-bit

comparison (waste)

Base Conversion

•

Previous work in s.p. MAC designs base conversion

–

Idea:

•

Shift both inputs to the left by amout specified in low-order bits of

exponents

•

Reduces size of exponent, requires wider adder

•

Example:

–

Base-8 conversion:

•

1.01011101, exp=10

 (1.36328125 x 2

 => ~5.7 million)

•

Shift to the left by 6 bits…

•

1010111.01, exp=10 (87.25 x 2

8*2

 = > ~5.7 million)

HPRCTA ’09

Exponent Compare vs. Adder Width

HPRCTA ’09

denorm

DSP48

DSP48

DSP48

renorm

Accumulator Design

HPRCTA ’09

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer

Input

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer







B1

Input

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer







B2

Input

B1

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer

B1



B2

Input



B3

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer

B1



Input



B4

B2+B3

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer



Input



B5

B2+B3

B1+B4

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer

Input





B6

B2+B3

B1+B4

B5

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer

Input





B7

B2+B3

+B6

B1+B4

B5

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer

Input





B8

B2+B3

+B6

B1+B4

+B7

B5

Three-Stage Reduction Architecture

HPRCTA ’09

“Adder” pipeline

Input

buffer

Output

buffer

Input

C1

B2+B3

+B6

B1+B4

+B7

B5+B8

Minimum Set Size

•

Four “configurations”:

•

Deterministic control sequence, triggered by set change:

–

D, A, C, B, A, B, B, C, B/D

•

Minimum set size is 8

HPRCTA ’09

Use Case:  Sparse Matrix-Vector Multiply

HPRCTA ’09

val

col

ptr

…

•

  Group vol/col

•

  Zero-terminate

SpMV Architecture

HPRCTA ’09

•

Enough memory bandwidth to

read:

–

5 val/col pairs (80 x 5 bits) per

cycle

–

~15-20 GB/s

•

Requires minimum number of

entries per row:

–

5 x 8 = 40

–

Many sparse matrices don’t have

this many values per row

–

Zero padding will degrade

performance for many matrices

New SpMV Architecture

HPRCTA ’09

•

Delete tree, replicate accumulator, schedule

matrix data:

Performance Results

HPRCTA ’09

Conclusions

•

Developed serially-delivered accumulator using base-

conversion technique

•

Limited to shallow pipelines

–

Deeper pipelines require large minimum set size

•

4 -> 11, 5 -> 19, 6 -> 23

•

Goal:  new reduction circuit to support deeper pipelines

with no minimum set size

•

Acknowledgements:

–

NSF awards CCF-0844951, CCF-0915608

HPRCTA ’09

Slide Note

Embed Share

Download

An integrated reduction technique for a double precision accumulator is discussed in the research paper by Krishna Nagar, Yan Zhang, and Jason Bakos from the University of South Carolina. The paper addresses issues related to double precision accumulation in kernels targeted for acceleration, presenting solutions for reduction problems and controlling partial sums. It explores reduction-based accumulator approaches and the complexity scaling with core operation latency, as well as optimization techniques for adder pipelines and base conversion in MAC designs.

fellisha Follow

Uploaded on Dec 08, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University of South Carolina

Double Precision Accumulation n = i Many kernels targeted for acceleration include ( ) f i 1 For large datasets, values delivered serially to an accumulator G+H +I, set 3 D+E +F, set 2 A+B +C, set 1 I, H, G, F, E, D, C, B, A, set 3 set 3 set 3 set 2 set 2 set 2 set 1 set 1 set 1 HPRCTA 09 2

The Reduction Problem + Partial sums Control Mem Mem HPRCTA 09 3

Reduction-Based Accumulator: Previous Work Paper # d.p. adder IP (~1000 slices/ea) Reduc n Logic Reduc n BRAM # DSP48 D.p. adder speed Accumulator speed Out-of- order outputs Prasanna DSA 07 (Virtex 2P) 2 2215 slices 3 n/a 170 MHz 142 MHz Yes Prasanna SSA 07 (Virtex 2P) 1 1804 slices 6 n/a 170 MHz 165 MHz Yes Gerards 08 (Virtex 4) 1 2722 slices 9 3 (from d.p. adder) 324 MHz 200 MHz No This work (Virtex 5) 0 < 1000 slices 0 3 355 MHz 300+ MHz No HPRCTA 09 4

Approach Reduction complexity scales with the latency of the core operation Reduce latency of double precision add? IEEE 754 adder pipeline (assume 4-bit significand): De- Compare exponents normalize smaller value Add 53-bit mantissas Re- Round Round normalize 1.1011 x 223 1.1110 x 221 1.1011 x 223 0.01111 x 223 10.00101 x 223 10.0011 x 223 1.00011 x 224 1.0010 x 224 HPRCTA 09 5

Adder Pipeline Mantissa addition Cascaded, pipelined DSP48 adders Scales well, operates fast De-normalize Exponent comparison and a variable shift of one significand Xilinx IP uses a DSP48 for the 11-bit comparison (waste) HPRCTA 09 6

Base Conversion Previous work in s.p. MAC designs base conversion Idea: Shift both inputs to the left by amout specified in low-order bits of exponents Reduces size of exponent, requires wider adder Example: Base-8 conversion: 1.01011101, exp=10110 (1.36328125 x 222 => ~5.7 million) Shift to the left by 6 bits 1010111.01, exp=10 (87.25 x 28*2 = > ~5.7 million) HPRCTA 09 7

Exponent Compare vs. Adder Width Exponent Width 7 6 5 4 3 Denormalize speed 119 MHz 246 MHz 368 MHz 372 MHz 494 MHz Adder Width 54 86 118 182 310 Base 16 32 64 128 256 #DSP48s 2 2 3 4 7 denorm DSP48 DSP48 DSP48 renorm HPRCTA 09 8

Accumulator Design HPRCTA 09 9

Three-Stage Reduction Architecture Input Output buffer Adder pipeline Input buffer HPRCTA 09 10

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B1 3 2 1 0 Input buffer HPRCTA 09 11

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B2 3 2 1 B1 Input buffer HPRCTA 09 12

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B3 3 + 2 B1 Input buffer B2 HPRCTA 09 13

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B4 3 + 2 B2+B3 B1 Input buffer HPRCTA 09 14

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B5 3 + 2 B1+B4 B2+B3 Input buffer HPRCTA 09 15

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B6 + 2 + 3 B1+B4 B2+B3 Input buffer B5 HPRCTA 09 16

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B7 + 2 + 3 B2+B3 +B6 B1+B4 Input buffer B5 HPRCTA 09 17

Three-Stage Reduction Architecture Input Output buffer Adder pipeline B8 + 2 + 3 B1+B4 +B7 B2+B3 +B6 Input buffer B5 HPRCTA 09 18

Three-Stage Reduction Architecture Input Output buffer Adder pipeline C1 B1+B4 +B7 B2+B3 +B6 B5+B8 0 Input buffer HPRCTA 09 19

Minimum Set Size Four configurations : Deterministic control sequence, triggered by set change: D, A, C, B, A, B, B, C, B/D Minimum set size is 8 HPRCTA 09 20

Use Case: Sparse Matrix-Vector Multiply 0 1 2 3 4 5 6 7 8 9 10 A 0 0 0 B 0 0 0 0 C 0 D E 0 0 0 F G H 0 0 0 0 0 0 0 I 0 0 0 K 0 0 val A B C D E F G H I J K col 0 4 3 5 0 4 5 0 2 4 3 0 J 0 0 2 4 7 8 10 11 ptr Group vol/col Zero-terminate (A,0) (B,4) (0,0) (C,3) (D,4) (0,0) HPRCTA 09 21

SpMV Architecture Enough memory bandwidth to read: 5 val/col pairs (80 x 5 bits) per cycle ~15-20 GB/s Requires minimum number of entries per row: 5 x 8 = 40 Many sparse matrices don t have this many values per row Zero padding will degrade performance for many matrices HPRCTA 09 22

New SpMV Architecture Delete tree, replicate accumulator, schedule matrix data: 400 bits val0,0 col0,0 val1,0 col1,0 val2,0 col2,0 val3,0 col3,0 val4,0 col4,0 val0,1 col0,1 val1,1 col1,1 val2,1 col2,1 val3,1 col3,1 val4,1 col4,1 val0,2 col0,2 val1,2 col1,2 val2,2 col2,2 val3,2 col3,2 val4,2 col4,2 val0,3 col0,3 val1,3 col1,3 val2,3 col2,3 val3,3 col3,3 val4,3 col4,3 val0,4 col0,4 val1,4 col1,4 val2,4 col2,4 val3,4 col3,4 val4,4 col4,4 val0,5 col0,5 val1,5 col1,5 val2,5 col2,5 val3,5 col3,5 val4,5 col4,5 val0,6 col0,6 0.0 0.0 val2,6 col2,6 val3,6 col3,6 val4,6 col4,6 val0,7 col0,7 0.0 5 val2,7 col2,7 val3,7 col3,7 val4,7 col4,7 val0,8 col0,8 val5,0 col5,0 val2,8 col2,8 val3,8 col3,8 val4,8 col4,8 HPRCTA 09 23

Performance Results HPRCTA 09 24

Conclusions Developed serially-delivered accumulator using base- conversion technique Limited to shallow pipelines Deeper pipelines require large minimum set size 4 -> 11, 5 -> 19, 6 -> 23 1 1 + lg Goal: new reduction circuit to support deeper pipelines with no minimum set size Acknowledgements: NSF awards CCF-0844951, CCF-0915608 HPRCTA 09 25

Integrated Reduction Technique for Double Precision Accumulator

Download Presentation

Presentation Transcript

Related

More Related Content