Enhancing Multi-Node Systems with Coherent DRAM Caches

Chiachen Chou, Georgia Tech

Aamer Jaleel, NVIDIA

Moinuddin K. Qureshi, Georgia Tech

MICRO 2016

Taipei, Taiwan

Oct 18, 2016

3D-DRAM: High Bandwidth Memory (HBM)

courtesy: Micron, AMD, Intel, NVIDIA

Compared to DDR, 3D DRAM as a cache (DRAM

Cache) transparently provides 4-8X bandwidth

Prior studies focus on single-node systems

DRAM$

DRAM$

Node 0

Node 1

Implicitly coherent, easy to implement

Memory-Side Cache is implicitly

coherent and simple to implement

L3

L3

long-latency

interconnect

local

local

DRAM$

DRAM$

Node 0

Node 1

Implicitly coherent, easy to implement

Cache only local data, long latency of remote data

A L3 cache miss of remote data incurs

a long latency in Memory-Side Cache

L3

L3

long-latency

interconnect

DRAM$

DRAM$

Node 0

Node 1

Cache local/remote data, save remote miss latency

Need coherence support

Coherent DRAM Cache saves the L3 miss latency

of remote data but needs coherence support

L3

L3

AVG: 1.3X

4-node system, each node has 1GB DRAM$

Ideal-CDC outperforms Memory-Side Cache by 30%

•

The Need for Coherent DRAM Cache

•

Challenge 1: A Large Coherence Directory

•

Challenge 2: A Slow Request-For-Data Operation

•

Summary

LLC

Node 0

Node 1

LLC

Coherence Directory (CDir): Sparse Directory*, tracking

cached data in the system

On a cache miss, the home node

accesses the CDir information

Coherence

Directory

Coherence

Directory

*Standalone inclusive directory with recalls

Memory

Memory

CDir

CDir

For giga-scale DRAM cache, the 64MB coherence

directory incurs storage and latency overheads

L3

L3

On-Die CDir

for L3

Coherence

Directory

DRAM$

Coherence directory size must be proportional to cache size

8MB

1MB

1GB

64MB

Memory-Side Cache

Coherent DRAM Cache

Memory

Memory

DRAM$

L3

DRAM$

Options:

1. SRAM-CDir: place the 64MB CDir on die (SRAM)

2. Embedded-CDir: embed the 64MB CDir in 3D-DRAM

CDir 64MB

Embedding CDir avoids the SRAM storage,

but incurs long access latency to CDir

3D-DRAM

L4$

DRAM$

miss

CDir Entry

DRAM$

SRAM

L3

SRAM-CDir

Embedded-CDir

1MB

3D-DRAM

64MB

CDir

Coherent DRAM Cache

Caching the recently used CDir entries for future references

in the unused on-die CDir of L3 coherence

DCB mitigates the latency to access embedded CDir

DRAM$

3D-DRAM

CDir

DCB

SRAM

64B

(16 CDir entries)

miss

One access to CDir in 3D-DRAM returns 16 CDir entries.

The hit rate of DCB is 80% with the co-

optimization of DCB and embedded-CDir

4-node system, each node has 1GB DRAM$

DRAM-cache Coherence Buffer (DCB): 21%

•

The Need for Coherent DRAM Cache

•

Challenge 1: A Large Coherence Directory

•

Challenge 2: A Slow Request-For-Data Operation

•

Summary

In Coherent DRAM Cache, Request-For-

Data incurs a slow 3D-DRAM access

RFD (fwd-getS): read the data from a remote cache

Home Node

MSC:

SRAM$

Read

Request-For-Data accesses only read-write shared data

Home Node

Read-write shared data bypass DRAM

caches and are stored only in L3 caches

Request-For-Data

DRAM

Ca

che for Multi-

e S

stems (CANDY)

AVG: 1.25X

CANDY: 25% improvement (5% within Ideal-CDC)

•

Coherent DRAM Cache faces two key challenges:

–

Large coherence directory

–

Slow Request-For-Data

•

DRAM

Ca

che for Multi-

e S

stems (CANDY)

–

DRAM-cache Coherence Buffer with embedded coherence directory

–

Sharing-Aware Bypass

•

CANDY outperforms Memory-Side Cache by 25% (5% within

Ideal Coherent DRAM Cache)

MICRO 2016

Taipei, Taiwan

Oct 18, 2016

Computer Architecture and Emerging Technologies Lab, Georgia Tech

Chiachen Chou, Georgia Tech

Aamer Jaleel, NVIDIA

Moinuddin K. Qureshi, Georgia Tech

Backup Slides

Compare to MSC, CANDY reduces 65% of the traffic

(1) Detecting read-write shared data

(2) Enforcing R/W shared data to bypass caches

Home Node

Sharing-Aware Bypass detects read-write shared

data at run-time based on coherence operations

CDir Entry

(1) Detecting read-write shared data

(2) Enforcing R/W shared data to bypass caches

•

L4 cache miss and L3 dirty eviction

Home Node

L3

DRAM$

Requester

Sharing-Aware Bypass enforces R/W

shared data to be stored only in L3 caches

DRAM$

Off-chip DRAM

4-Node

•

Evaluation in Sniper simulator, baseline: Memory-Side Cache

•

13 parallel benchmarks from NAS, SPLASH2, PARSEC, NU-bench

Slide Note

Embed Share

Download

Exploring the integration of Coherent DRAM Caches in multi-node systems to improve memory performance. Discusses the benefits, challenges, and potential performance improvements compared to existing memory-side cache solutions.

weld_les Follow

Uploaded on Sep 07, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

CANDY: ENABLING COHERENT DRAM CACHES FOR MULTI-NODE SYSTEMS MICRO 2016 Taipei, Taiwan Oct 18, 2016 Chiachen Chou, Georgia Tech Aamer Jaleel, NVIDIA Moinuddin K. Qureshi, Georgia Tech

3D-DRAM HELPS MITIGATE BANDWIDTH WALL 3D-DRAM: High Bandwidth Memory (HBM) P L1 P L1 L3$ 3D-DRAM DRAM Cache Off-chip DRAM Compared to DDR, 3D DRAM as a cache (DRAM Cache) transparently provides 4-8X bandwidth NVIDIA PASCAL AMD Zen Intel Xeon Phi courtesy: Micron, AMD, Intel, NVIDIA 2

DRAM CACHES FOR MULTI-NODE SYSTEMS Prior studies focus on single-node systems Node 0 Node 1 P P P P P P P P P P P P L3$ L3$ L3$ DRAM$ DRAM$ long-latency inter-node DRAM$ Off-chip DRAM Off-chip DRAM network Off-chip DRAM We study DRAM caches for Multi-Node systems 3

MEMORY-SIDE CACHE (MSC) Implicitly coherent, easy to implement Node 0 Node 1 P P P P P P P P L3 L3 DRAM$ DRAM$ local local long-latency interconnect Memory-Side Cache is implicitly coherent and simple to implement 4

SHORTCOMINGS OF MEMORY-SIDE CACHE Implicitly coherent, easy to implement Cache only local data, long latency of remote data Node 0 Node 1 P P P P P P P P L3 ~4MB L3 DRAM$ DRAM$ remote ~1GB long-latency interconnect A L3 cache miss of remote data incurs a long latency in Memory-Side Cache 5

COHERENT DRAM CACHES (CDC) Cache local/remote data, save remote miss latency Need coherence support Node 0 Node 1 P P P P P P P P L3 L3 DRAM$ DRAM$ remote Coherent DRAM Cache saves the L3 miss latency of remote data but needs coherence support 6

POTENTIAL PERFORMANCE IMPROVEMENT 4-node system, each node has 1GB DRAM$ MSC Ideal Coherent DRAM Cache 3.0 2.5 AVG: 1.3X 2.0 Speedup Ideal-CDC outperforms Memory-Side Cache by 30% 1.5 1.0 0.5 0.0 water.n fmm facesim fluid radiosity stream mg ocean.c dedup vips barnes AVG kmeans ocean.nc 7

AGENDA The Need for Coherent DRAM Cache Challenge 1: A Large Coherence Directory Challenge 2: A Slow Request-For-Data Operation Summary 8

DIRECTORY-BASED COHERENCE PROTOCOL Coherence Directory (CDir): Sparse Directory*, tracking cached data in the system Node 0 Node 1 P P P P P P P P LLC LLC Coherence Directory Coherence Directory Memory CDir CDir Memory On a cache miss, the home node accesses the CDir information *Standalone inclusive directory with recalls 9

LARGE COHERENCE DIRECTORY Coherence directory size must be proportional to cache size Coherent DRAM Cache Memory-Side Cache P P P P P P P P L3 L3 8MB On-Die CDir for L3 1GB DRAM$ 1MB Coherence Directory Memory DRAM$ 64MB Memory For giga-scale DRAM cache, the 64MB coherence directory incurs storage and latency overheads 10

WHERE TO PLACE COHERENCE DIRECTORY? Options: 1. SRAM-CDir: place the 64MB CDir on die (SRAM) 2. Embedded-CDir: embed the 64MB CDir in 3D-DRAM DRAM$ miss P P P P P P P CDir 64MB P SRAM L3 L3 L3 3D-DRAM L4$ DRAM$ DRAM$ CDir 64MB Embedded-CDir SRAM-CDir CDir Entry Embedding CDir avoids the SRAM storage, but incurs long access latency to CDir 11

DRAM-CACHE COHERENCE BUFFER (DCB) Caching the recently used CDir entries for future references in the unused on-die CDir of L3 coherence Memory-Side Cache Coherent DRAM Cache DRAM$ miss P P P P P P P P DRAM$ L3 On-Die CDir for L3 On-Die CDir for L3 DRAM-cache Coherence Buffer 1MB 1MB Unused Hit Miss CDir 3D-DRAM CDir Entry 64MB DCB mitigates the latency to access embedded CDir 12

DESIGN OF DRAM-CACHE COHERENCE BUFFER One access to CDir in 3D-DRAM returns 16 CDir entries. SRAM insert DCB miss = = = = 3D-DRAM Demand CDir 64B (16 CDir entries) Set S 4-way set-associative S+1 S+2 S+3 The hit rate of DCB is 80% with the co- optimization of DCB and embedded-CDir 13

EFFECTIVENESS OF DCB 4-node system, each node has 1GB DRAM$ Embedded-CDir DRAM-cache Coherence Buffer (DCB) 3.0 2.5 2.0 Speedup DRAM-cache Coherence Buffer (DCB): 21% 1.5 1.0 0.5 0.0 ocean.c water.n fmm facesim fluid radiosity stream dedup mg vips barnes AVG kmeans ocean.nc 14

AGENDA The Need for Coherent DRAM Cache Challenge 1: A Large Coherence Directory Challenge 2: A Slow Request-For-Data Operation Summary 15

SLOW REQUEST-FOR-DATA (RFD) RFD (fwd-getS): read the data from a remote cache cache miss Remote Home Node Coherence Directory DRAM$ L3 Request-For-Data MSC: SRAM$ Read CDC: DRAM$ Read extra latency In Coherent DRAM Cache, Request-For- Data incurs a slow 3D-DRAM access 16

SHARING-AWARE BYPASS Request-For-Data accesses only read-write shared data cache miss Owner Home Node Coherence Directory P P P P M L3 M DRAM$ Request-For-Data Read-write shared data bypass DRAM caches and are stored only in L3 caches 17

PERFORMANCE IMPROVEMENT OF CANDY DRAM Cache for Multi-Node Systems (CANDY) DCB CANDY Ideal-CDC (64MB SRAM+idealized RFD) 3.0 2.5 AVG: 1.25X 2.0 CANDY: 25% improvement (5% within Ideal-CDC) Speedup 1.5 1.0 0.5 0.0 kmeans water.n fmm facesim radiosity stream fluid ocean.c barnes mg dedup vips AVG ocean.nc 18

SUMMARY Coherent DRAM Cache faces two key challenges: Large coherence directory Slow Request-For-Data DRAM Cache for Multi-Node Systems (CANDY) DRAM-cache Coherence Buffer with embedded coherence directory Sharing-Aware Bypass CANDY outperforms Memory-Side Cache by 25% (5% within Ideal Coherent DRAM Cache) 19

THANK YOU CANDY: ENABLING COHERENT DRAM CACHES FOR MULTI-NODE SYSTEMS MICRO 2016 Taipei, Taiwan Oct 18, 2016 Chiachen Chou, Georgia Tech Aamer Jaleel, NVIDIA Moinuddin K. Qureshi, Georgia Tech Computer Architecture and Emerging Technologies Lab, Georgia Tech

Backup Slides 21

DCB HIT RATE DCB? Hit? Rate DCB-Demand DCB-SpatLoc 100 DCB? Hit? Rate? (%) 80 60 40 20 0 radiosity ocean.c barnes facesim ocean.nc vips stream mg dedup kmeans AVG fluid water.n fmm NPB SPLASH2 PARSEC OTHERS 22

OPERATION BREAKDOWN Consequent? Operation? of? L3? Misses? in? CDC RFD INV Mem Hit 100.00 80.00 Percentage 60.00 40.00 20.00 0.00 radiosity ocean.c barnes ocean.nc facesim vips mg stream dedup AVG kmeans fluid water.n fmm NPB SPLASH2 PARSEC OTHERS 23

INTER-NODE NETWORK TRAFFIC REDUCTION CANDY 100.00 Traffic? Reduction? (%) 80.00 60.00 40.00 20.00 0.00 fmm AVG mg dedup ocean.ncont ocean.cont water.nsq kmeans barnes vips fluidanimate facesim radiosity streamcluster NPB SPLASH2 PARSEC OTHERS Compare to MSC, CANDY reduces 65% of the traffic 24

PERFORMANCE (NUMA-AWARE SYSTEMS) Performance? NUMA-Aware SW-Opt 3.00 Speedup?(w.r.t.? MSC) 2.50 2.00 1.50 1.00 0.50 0.00 radiosity streamcluster barnes kmeans facesim water.nsq fluidanimate dedup fmm vips ocean.cont mg AVG ocean.ncont NPB SPLASH2 PARSEC OTHERS 25

SHARING-AWARE BYPASS (1) (1) Detecting read-write shared data (2) Enforcing R/W shared data to bypass caches cache request Home Node Coherence Directory CDir Entry 1 M RWS Memory Read Invalidate Request-For-Data Flush Read-write shared data, set Read-Write Shared (RWS) bit Sharing-Aware Bypass detects read-write shared data at run-time based on coherence operations 26

SHARING-AWARE BYPASS (2) (1) Detecting read-write shared data (2) Enforcing R/W shared data to bypass caches L4 cache miss and L3 dirty eviction BypL4 bit BypL4 bit + Data cache miss Requester 1 Home Node RWS L3 M Dirty Eviction 1 1 M Yes BypL4? BypL4 No Yes BypL4? DRAM$ Data No Sharing-Aware Bypass enforces R/W shared data to be stored only in L3 caches 27

METHODOLOGY P P P P 4-Node DRAM$ Off-chip DRAM DRAM Cache 1GB DDR3.2GHz, 64-bit 8 channels, 16 banks/ch DRAM Memory 16GB DDR1.6GHz, 64-bit 2 channels 8 banks/ch 4-Node, each node: 4 cores 3.2 GHz 2-wide OOO 4MB 16-way L3 shared cache Capacity Bus Channel Evaluation in Sniper simulator, baseline: Memory-Side Cache 13 parallel benchmarks from NAS, SPLASH2, PARSEC, NU-bench 28

Enhancing Multi-Node Systems with Coherent DRAM Caches

Download Presentation

Presentation Transcript

Related

More Related Content