Architectural Support for Effective Data Compression in Irregular Applications

S
P
Z
IP
: A
RCHITECTURAL
 S
UPPORT
 
FOR
 E
FFECTIVE
 D
ATA
C
OMPRESSION
 I
N
 I
RREGULAR
 A
PPLICATIONS
Yifan Yang
, Joel S. Emer, Daniel Sanchez
ISCA 2021
Session 12B (June 16, 2021 at 8 PM EDT)
Irregular applications are memory bound
2
 
Irregular applications, such as graph analytics and sparse linear
algebra, are an increasingly important workload domain
 
 
 
 
Irregular applications are often 
memory bound
 
Data compression is an attractive approach to accelerate irregular
applications
Existing data compression solutions are not tailored for irregular applications
3
 
Hardware compression
units for 
sequentially
accessed long streams
e.g., IBM z15 [ISCA’20]
 
Compressed memory
hierarchies support
random accesses
e.g., VSC [ISCA’04]
 
This work is optimized for
indirect, data-dependent
accesses to short streams
Understanding 
access patterns in irregular applications
4
 
v0
 
v2
 
v3
 
vertices
 
neighbors
 
Graph Adjacency List
Existing data compression solutions are not tailored for irregular applications
5
 
 
Limited compression gain
on short streams
 
 
Data decompression
increases critical path
latency
v0
v2
v3
vertices
neighbors
Hardware compression
units for 
sequentially
accessed long streams
e.g., IBM z15 [ISCA’20]
Compressed memory
hierarchies support
random accesses
e.g., VSC [ISCA’04]
This work is optimized for
indirect, data-dependent
accesses to short streams
Compressing data structures in irregular applications is hard
6
Core
Graph Adjacency List
 
access
 
access
 
access
v0
v2
v3
vertices
neighbors
 
vertex data
 
Challenge 1: Access and
decompression are
interleaved
Compressing data structures in irregular applications is hard
7
 
Insight 1: Specialized hw to
accelerate data access and
decompression
Exploit 
decoupled execution
to hide memory access and
decompression latencies
Fetcher
Core
Graph Adjacency List
access
access
 
decompress
 
access
v0
v2
v3
vertices
neighbors
vertex data
Challenge 1: Access and
decompression are
interleaved
compressed
Compressing data structures in irregular applications is hard
8
 
Challenge 2: Need to support
various access patterns and
compression formats
 
Insight 2: 
Programmable
hardware
A pipeline consists of a set of
composable
 operators
expressing the traversal and
decompression of data structures
 
Dataflow Configuration
Language (DCL)
Access
Access
Fetcher
Core
Decompress
Graph Adjacency List
v0
v2
v3
vertices
neighbors
Agenda
9
Motivation
SpZip Dataflow Configuration Language (DCL)
SpZip Design
Evaluation
Dataflow Configuration Language (DCL) overview
10
 
A DCL program expresses the 
t
raversal, decompression and compression of
data structures in irregular applications
DCL program is an acyclic graph of composable operators
Operators are connected by queues to exploit pipeline parallelism
Range
Decompress
Indir
 
Memory access operator
in Core
in Core
Traversing a sparse matrix in DCL
11
 
Compressed Sparse Row (CSR) format
0
2
4
5
7
1,a
2,b
0,c
2,d
3,e
1,f
2,g
 
offsets
 
rows
 
0   1   2   3   4
 
OffsetsQ
 
InputQ
 
RowsQ
        
Range
Memory
        
Range
0
 
offsets
[
0
]
0
 
offsets
[
1
]
2
 
rows
[
0
]
1,a
4
 
rows
[
1
]
2
,
b
 
offsets
[
2
]
 
rows
[
2
]
0,c
 
rows
[
3
]
2
,
d
 
{coordinate, value} pairs
in DCL
in DCL
 
Accessing
offsers
 
Accessing
rows
5
 
rows
[
offsets
[
i
]:
     offsets
[
i+1
]]:
 
 
for 
i
 in range(
numRows
):
  for {
col, val
} in
    visit({
col, val
})
in DCL
in DCL
in Core
in Core
Traversing a sparse matrix in DCL
12
CSR format
row0
row1
row2
row3
offsets
rows
0   1   2   3   4 
OffsetsQ
InputQ
Range
Range
RowsQ
rows
[
offsets
[
i
]:
     offsets
[
i+1
]]:
for 
i
 in range(
numRows
):
  for {
col, val
} in
    visit({
col, val
})
uncompressed
in DCL
in DCL
in Core
in Core
Data decompression support in DCL
13
for 
i
 in range(
numRows
):
  for {
col, val
} in
    visit({
col, val
})
CSR with individually compressed rows
 
decompress(
                                )
 
rows
[
offsets
[
i
]:
     offsets
[
i+1
]]:
row0
row1
row2
row3
offsets
rows
0   1   2   3   4 
compressed
OffsetsQ
InputQ
Range
Range
 
RowsQ
compressed
Using DCL in PageRank traversing multiple data structures
14
offsets
neigh
s
0   1   2   3   4 
compressed
OffsetsQ
InputQ
Range
Range
 
Compressed Adjacency Matrix
Range
 
Source Vertex Data
Indir
 
Prefetch
 Destination
Vertex Data
 
contribs
 
scores
Graph
adjacency
matrix
Core
 
def PageRankIter(Graph g, Array 
scores
,
                          Array 
contribs
)
  for 
src
 in range(g.
numVertices
):
    for 
dst
 in decompress(g.
neighs
[g.
offsets
[
src
]:
                                   g.
offsets
[
src+1
]]):
      
scores
[
dst
] += 
contribs
[
src
]
 
compressed
Agenda
15
Motivation
SpZip Dataflow Configuration Language (DCL)
SpZip Design
Evaluation
SpZip Overview
16
 
SpZip augments each CPU
core with a programmable
fetcher and compressor
 
The fetcher accelerates
data structure traversal
and decompression
 
The compressor compresses newly generated data before storing it off-chip
 
Fetcher and compressor issue conventional cache line requests
SpZip exploits decoupled execution
17
The fetcher and compressor communicate with core
through queues to exploit decoupled execution
 
The fetcher runs ahead of the core to traverse and decompress data, hiding
memory access and decompression latencies
Fetcher
SpZip fetcher microarchitecture
18
 
Access Unit and Decompression
Unit implement DCL operators
 
Scratchpad holds queues
between operators
 
Queues between operators
allow pipeline parallelism
L2
Core
Range
Range
Fetcher
SpZip fetcher is programmable
19
 
Scratchpad is configurable to
support variable numbers and
sizes of queues
 
DCL operators are time-
multiplexed on the same
physical unit
 
Scheduler holds operator
contexts and chooses which
operator to fire each cycle
L2
Core
 
Compressed
NeighsQ
 
OffsetsQ
 
InputQ
 
ctxt 0
 
ctxt 1
Range
Range
Decompress
Compressor
SpZip compressor overview
20
 
Compression uses a
different set of DCL
operators
 
Similar decoupled and
programmable design
as the fetcher
 
See paper for more
details
LLC
Core
MQU
Compress
Unit
SWU
StreamWr
MemQ
C
ompress
Agenda
21
Motivation
SpZip Dataflow Configuration Language (DCL)
SpZip Design
Evaluation
Evaluation Methodology
22
 
Event-driven simulation using ZSim
SpZip system
16 Haswell-like OOO cores
32 MB L3 cache
4 memory controllers (51.2GB/s)
SpZip adds 0.2% area overhead
of per-core fetcher and
compressor
 
Irregular applications
PageRank, PageRank Delta,
Connected Components, Radii
Estimation, BFS, Degree Counting,
SPMV
 
Large real world inputs
Up to 100 million vertices
Up to 1 billion edges
 
 
SpZip improves performance and reduces traffic
23
See paper for
24
DCL support for compressing data structures
Programmable compressor design
Additional evaluation results
Impact of preprocessing
Benefits over compressed memory hierarchies
Impact of decoupled fetching vs data compression
Conclusions
25
 
Irregular applications have indirect, data-dependent memory access patterns
that make compression challenging
 
SpZip makes data compression practical for irregular applications
Decoupled execution hides memory access and decompression latencies
DCL and programmable design support wide range of data structures and
compression formats
 
SpZip achieves significant speedups and memory traffic reductions on
irregular applications
T
HANKS
 F
OR
 Y
OUR
 A
TTENTION
!
ISCA 2021
Session 12B (June 16, 2021 at 8 PM EDT)
Slide Note

Hi! My name is Yifan Yang. It is my great pleasure to present SpZip: Architectural Support for Effective Data Compression In Irregular Applications. This work is done in collaboration with Professor Joel Emer, and Professor Daniel Sanchez at MIT. Please attend our live session on June 16th. ©

Embed
Share

Irregular applications, such as graph analytics and sparse linear algebra, are memory-bound workloads. This paper discusses the challenges of compressing data structures in irregular applications, focusing on specialized hardware to accelerate data access and decompression. The study highlights the importance of understanding access patterns and proposes an optimized solution for indirect, data-dependent accesses to short streams, aiming to accelerate irregular applications that existing compression solutions are not tailored for.

  • Data compression
  • Irregular applications
  • Hardware optimization
  • Memory-bound
  • Graph analytics

Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. SPZIP: ARCHITECTURAL SUPPORT FOR EFFECTIVE DATA COMPRESSION IN IRREGULAR APPLICATIONS Yifan Yang, Joel S. Emer, Daniel Sanchez ISCA 2021 Session 12B (June 16, 2021 at 8 PM EDT)

  2. Irregular applications are memory bound 2 Irregular applications, such as graph analytics and sparse linear algebra, are an increasingly important workload domain Social network analysis Navigation Scientific computing Irregular applications are often memory bound Data compression is an attractive approach to accelerate irregular applications

  3. Existing data compression solutions are not tailored for irregular applications 3 Hardware compression units for sequentially accessed long streams e.g., IBM z15 [ISCA 20] Compressed memory hierarchies support random accesses e.g., VSC [ISCA 04] This work is optimized for indirect, data-dependent accesses to short streams

  4. Understanding access patterns in irregular applications 4 Graph Adjacency List v0 v2 v3 vertices neighbors

  5. Existing data compression solutions are not tailored for irregular applications 5 Hardware compression units for sequentially accessed long streams e.g., IBM z15 [ISCA 20] Compressed memory hierarchies support random accesses e.g., VSC [ISCA 04] This work is optimized for indirect, data-dependent accesses to short streams v0 v2 v3 vertices neighbors Data decompression increases critical path latency Limited compression gain on short streams

  6. Compressing data structures in irregular applications is hard 6 Challenge 1: Access and decompression are interleaved Graph Adjacency List vertex data v0 v2 v3 Core vertices neighbors access access access

  7. Compressing data structures in irregular applications is hard 7 Challenge 1: Access and decompression are interleaved Graph Adjacency List vertex data v0 v2 v3 Insight 1: Specialized hw to accelerate data access and decompression Exploit decoupled execution to hide memory access and decompression latencies Fetcher Core vertices compressed neighbors access access decompress access

  8. Compressing data structures in irregular applications is hard 8 Challenge 2: Need to support various access patterns and compression formats Graph Adjacency List v0 v2 v3 Insight 2: Programmable hardware A pipeline consists of a set of composable operators expressing the traversal and decompression of data structures Fetcher Core vertices neighbors Access Access Decompress Dataflow Configuration Language (DCL)

  9. Agenda 9 Motivation SpZip Dataflow Configuration Language (DCL) SpZip Design Evaluation

  10. Dataflow Configuration Language (DCL) overview 10 A DCL program expresses the traversal, decompression and compression of data structures in irregular applications DCL program is an acyclic graph of composable operators Operators are connected by queues to exploit pipeline parallelism Memory access operator Indir ;?2;?1 ;? ?2;?[?1] ; ?2,?2;(?1,?1) ;? ?2 1 , ,?[?2];? ?1 1 , ,?[?1] Range Decompress

  11. Traversing a sparse matrix in DCL 11 Compressed Sparse Row (CSR) format 0 1 2 3 4 in DCL for i in range(numRows): for {col, val} in 0 2 4 5 7 offsets rows[offsets[i]: offsets[i+1]]: rows 1,f 2,d 3,e 2,g 1,a 2,b 0,c in Core visit({col, val}) {coordinate, value} pairs 0 2 4 1,a 2,b 0,c 2,d Memory offsets[0] offsets[1] offsets[2] rows[0] rows[1] rows[2] rows[3] 5 0 Range Range InputQ Accessing rows Accessing offsers OffsetsQ RowsQ

  12. Traversing a sparse matrix in DCL 12 CSR format 0 1 2 3 4 in DCL for i in range(numRows): for {col, val} in offsets rows[offsets[i]: offsets[i+1]]: rows in Core row3 row2 row0 row1 visit({col, val}) uncompressed Range Range InputQ OffsetsQ RowsQ

  13. Data decompression support in DCL 13 CSR with individually compressed rows 0 1 2 3 4 in DCL for i in range(numRows): for {col, val} in offsets decompress( rows[offsets[i]: offsets[i+1]]: compressed ) rows in Core row3 row0 row1 row2 visit({col, val}) compressed Decompress Range Range InputQ Compressed RowsQ OffsetsQ RowsQ

  14. Using DCL in PageRank traversing multiple data structures14 def PageRankIter(Graph g, Array scores, Array contribs) for src in range(g.numVertices): for dst in decompress(g.neighs[g.offsets[src]: g.offsets[src+1]]): scores[dst] += contribs[src] contribs Graph adjacency matrix 0 1 2 3 4 offsets compressed neighs compressed scores ContribsQ Source Vertex Data Range Core Compressed Adjacency Matrix Decompress Range Range InputQ Compressed NeighsQ OffsetsQ NeighsQ Prefetch Destination Vertex Data Indir

  15. Agenda 15 Motivation SpZip Dataflow Configuration Language (DCL) SpZip Design Evaluation

  16. SpZip Overview 16 SpZip augments each CPU core with a programmable fetcher and compressor Main Memory LLC L2 L2 Compressor Compressor Fetcher Fetcher L1 L1 The fetcher accelerates data structure traversal and decompression Core Core The compressor compresses newly generated data before storing it off-chip Fetcher and compressor issue conventional cache line requests

  17. SpZip exploits decoupled execution 17 The fetcher and compressor communicate with core through queues to exploit decoupled execution Compressor Fetcher L1 Core The fetcher runs ahead of the core to traverse and decompress data, hiding memory access and decompression latencies

  18. SpZip fetcher microarchitecture 18 Access Unit and Decompression Unit implement DCL operators L2 Access Unit Range Decomp. Unit Delta Scheduler ctxt 0 BPC Indir ctxt k Scratchpad holds queues between operators q2 Fetcher q0 q1 qn Scratchpad Queues between operators allow pipeline parallelism Core

  19. SpZip fetcher is programmable 19 Scratchpad is configurable to support variable numbers and sizes of queues L2 Access Unit Range Decomp. Unit Delta Scheduler ctxt 0 BPC Indir ctxt k DCL operators are time- multiplexed on the same physical unit q2 Fetcher q0 q1 qn Scratchpad Core ctxt 0 ctxt 1 Scheduler holds operator contexts and chooses which operator to fire each cycle Decompress Range Range Range Range InputQ Compressed NeighsQ OffsetsQ NeighsQ

  20. SpZip compressor overview 20 Compression uses a different set of DCL operators StreamWr MemQ Compress LLC Similar decoupled and programmable design as the fetcher Compress Unit Scheduler ctxt 0 SWU MQU ctxt k q2 See paper for more details Compressor q0 q1 qn Scratchpad Core

  21. Agenda 21 Motivation SpZip Dataflow Configuration Language (DCL) SpZip Design Evaluation

  22. Evaluation Methodology 22 Irregular applications PageRank, PageRank Delta, Connected Components, Radii Estimation, BFS, Degree Counting, SPMV Event-driven simulation using ZSim SpZip system 16 Haswell-like OOO cores 32 MB L3 cache 4 memory controllers (51.2GB/s) SpZip adds 0.2% area overhead of per-core fetcher and compressor Large real world inputs Up to 100 million vertices Up to 1 billion edges

  23. SpZip improves performance and reduces traffic 23

  24. See paper for 24 DCL support for compressing data structures Programmable compressor design Additional evaluation results Impact of preprocessing Benefits over compressed memory hierarchies Impact of decoupled fetching vs data compression

  25. Conclusions 25 Irregular applications have indirect, data-dependent memory access patterns that make compression challenging SpZip makes data compression practical for irregular applications Decoupled execution hides memory access and decompression latencies DCL and programmable design support wide range of data structures and compression formats SpZip achieves significant speedups and memory traffic reductions on irregular applications

  26. THANKS FOR YOUR ATTENTION! ISCA 2021 Session 12B (June 16, 2021 at 8 PM EDT)

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#