Enhancing Data Movement Efficiency in DRAM with Low-Cost Inter-Linked Subarrays (LISA)

Low-Cost Inter-Linked Subarrays

(LISA)

Enabling Fast Inter-Subarray Data Movement in DRAM

Kevin Chang

Prashant Nair,

Donghyuk Lee

Saugata Ghose

Moinuddin Qureshi, and

Onur Mutlu

Problem: Inefficient Bulk Data Movement

Bulk data movement is a key operation in many applications

–

memmove

memcpy:

5% cycles in Google’s datacenter

[Kanev+ ISCA’15]

dst

src

Long latency and high energy

LLC

64 bits

Moving Data Inside DRAM?

…

Key Idea and Applications

•

Low-cost Inter-linked subarrays (LISA)

–

Fast bulk data movement between subarrays

–

Wide datapath via isolation transistors

: 0.8% DRAM chip area

•

LISA is a

versatile substrate

→

new applications

Fast bulk data copy

: Copy latency

1.363ms→0.148ms

(9.2x)

→

% speedup, -55% DRAM energy

In-DRAM caching

: Hot data access latency

48.7ns→21.5ns

(2.2x)

→

5% speedup

Fast precharge

Precharge latency

13.1ns→5.0ns

(2.6x)

→

8% speedup

Outline

•

Motivation and Key Idea

•

DRAM Background

•

LISA Substrate

–

New DRAM Command to Use LISA

•

Applications of LISA

DRAM Internals

8~16 banks per chip

Subarray

DRAM Operation

ACTIVATE

 Store the row

into the

row buffer

READ

Select the target

column and drive to I/O

PRECHARGE

: Reset the

bitlines for a new ACTIVATE

dd

/2

dd

Bitline Voltage Level:

Outline

•

Motivation and Key Idea

•

DRAM Background

•

LISA Substrate

–

New DRAM Command to Use LISA

•

Applications of LISA

Observations

Internal Data Bus (64b)

Low-Cost Interlinked Subarrays (LISA)

Interconnect bitlines

of adjacent

subarrays in a bank using

isolation transistors (links)

ON

LISA forms a wide datapath b/w subarrays

64b

8kb

New DRAM Command to Use LISA

Row Buffer Movement (RBM)

Move a row of data in

an activated row buffer to a precharged one

…

dd

Δ

dd

/2+

Δ

Activated

Precharged

Amplify the charge

Activated

RBM transfers an entire row b/w subarrays

RBM Analysis

•

The range of RBM depends on the DRAM design

–

Multiple RBMs to move data across > 3 subarrays

•

Validated with SPICE using worst-case cells

–

NCSU FreePDK 45nm library

•

4KB data in 8ns

(w/ 60% guardband)

→

500 GB/s,

26x

bandwidth of a DDR4-2400 channel

•

0.8%

DRAM chip area overhead

[O+ ISCA’14]

Outline

•

Motivation and Key Idea

•

DRAM Background

•

LISA Substrate

–

New DRAM Command to Use LISA

•

Applications of LISA

–

. Rapid Inter-Subarray Copying (RISC)

–

. Variable Latency DRAM (VILLA)

–

. Linked Precharge (LIP)

•

Goal:

Efficiently copy a row across subarrays

•

Key idea

: Use

RBM

 to form a new command sequence

src row

dst row

Reduces row-copy latency by 9.2x,

DRAM energy by 48.1x

Methodology

•

Cycle-level simulator: Ramulator

[CAL’15]

https://github.com/CMU-SAFARI/ramulator

•

CPU:

4 out-of-order cores

, 4GHz

•

L1: 64KB/core, L2: 512KB/core, L3: shared 4MB

•

DRAM:

DDR3-1600, 2 channels

•

Benchmarks:

–

Memory-intensive

TPC, STREAM, SPEC2006, DynoGraph, random

–

Copy-intensive

Bootup, forkbench, shell script

•

50 workloads: Memory- + copy-intensive

•

Performance metric: Weighted Speedup (WS)

Comparison Points

•

Baseline

: Copy data through CPU (existing systems)

•

RowClone

[Seshadri+ MICRO’13]

–

In-DRAM bulk copy scheme

–

Fast

intra

-subarray copying via bitlines

–

Slow

inter

-subarray copying via internal data bus

System Evaluation: RISC

-24%

5%

Degrades bank-level parallelism

Rapid Inter-Subarray Copying (RISC) using LISA

improves system performance

•

Goal

: Reduce DRAM latency with low area overhead

•

Motivation

: Trade-off between area and latency

High area overhead: >40%

Long Bitline

(DDRx)

Short Bitline

(RLDRAM)

•

Key idea

: Reduce access latency of hot data via a

heterogeneous DRAM

design

[Lee+ HPCA’13, Son+ ISCA’13]

•

VILLA

Add fast subarrays as a

cache

in each bank

Reduces hot data access latency by 2.2x

at only 1.6% area overhead

Challenge

: VILLA cache requires

frequent movement of data rows

System Evaluation: VILLA

Caching hot data in DRAM using LISA

improves system performance

50 quad-core workloads: memory-intensive benchmarks

•

Problem

: The precharge time is limited by the strength

of one precharge unit

•

Linked Precharge (LIP)

LISA precharges a subarray

using multiple precharge units

Activated row

Conventional DRAM

LISA DRAM

Reduces precharge latency by 2.6x

(43% guardband)

System Evaluation: LIP

50 quad-core workloads: memory-intensive benchmarks

Other Results in Paper

•

Combined applications

•

Single-core results

•

Sensitivity results

–

LLC size

–

Number of channels

–

Copy distance

•

Qualitative comparison to other hetero. DRAM

•

Detailed quantitative comparison to RowClone

Summary

•

Bulk data movement is inefficient in today’s systems

•

Low connectivity between subarrays is a bottleneck

•

Low-cost Inter-linked subarrays (LISA)

–

Bridge bitlines of subarrays via isolation transistors

–

Wide datapath with

0.8% DRAM chip area

•

LISA is a

versatile substrate

→

new applications

–

Fast bulk data copy

: 66% speedup, -55% DRAM energy

–

In-DRAM caching

: 5% speedup

–

Fast precharge

8% speedup

–

LISA can enable other applications

•

Source code will be available in April

https://github.com/CMU-SAFARI

Low-Cost Inter-Linked Subarrays

(LISA)

Enabling Fast Inter-Subarray Data Movement in DRAM

Kevin Chang

Prashant Nair,

Donghyuk Lee

Saugata Ghose

Moinuddin Qureshi, and

Onur Mutlu

Backup

SPICE on RBM

Comparison to Prior Works

Comparison to Prior Works

LISA vs. Samsung Patent

•

S.-Y. Seo,

“Methods of Copying a Page in a Memory Device and Methods of

Managing Pages in a Memory System,”

U.S. Patent Application 20140185395, 2014

•

Only for copying data

•

Vague. Lack of detail on implementation

–

How does data get moved? What are the steps?

•

No analysis on the latency and energy

•

No system evaluation

RBM Across 3 Subarrays

Comparison of Inter-Subarray Row Copying

RISC: Cache Coherence

•

Data in DRAM may not be up-to-date

•

MC performs flushes dirty data (src) and invalidates dst

•

Techniques to accelerate cache coherence

–

Dirty-Block Index [Seshadri+ ISCA’14]

•

Other papers handle the similar issue

[Seshadri+ MICRO’13, CAL’15]

RISC vs. RowClone

4-core results

RowClone

Sensitivity of Cache Size

Single core

: RISC vs. baseline as LLC size changes

•

Baseline: higher cache pollution as LLC size decreases

•

Forkbench

•

RISC: Hit rate – 67% (4MB) to 10% (256KB)

•

Base: Hit rate – 20% to 19%

Combined Applications

+16%

+8%

59%

Sensitivity to Copy Distance

VILLA Caching Policy

•

Benefit-based caching policy [HPCA’13]

–

A benefit counter to track # of accesses per cached row

•

Any caching policy can be applied to VILLA

•

Configuration

–

32 rows inside a fast subarray

–

4 fast subarrays per bank

–

1.6% area overhead

Area Measurement

•

Row-Buffer Decoupling

by O et al., ISCA’14

•

28nm DRAM process,

–

3 metal layers

–

8Gb and 8 banks per device

Other slides

Low-Cost Inter-Linked Subarrays

(LISA)

Enabling Fast Inter-Subarray Data Movement in DRAM

Kevin Chang

Prashant Nair,

Donghyuk Lee

Saugata Ghose

Moinuddin Qureshi, and

Onur Mutlu

•

Problem

: The precharge time is limited by the strength

of one precharge unit (PU)

•

Linked Precharge (LIP)

LISA’s connectivity enables

DRAM to utilize additional PUs from other subarrays

Activated row

Conventional DRAM

LISA DRAM

Key Idea and Applications

•

Low-cost Inter-linked subarrays (LISA)

–

Fast bulk data movement b/w subarrays

–

Wide datapath via isolation transistors

: 0.8% DRAM chip area

•

LISA is a

versatile substrate

→

new applications

1.

Fast bulk data copy:

Copy latency

1.3ms→0.1ms (9x)

↑

% sys. performance and 55% energy efficiency

2.

In-DRAM caching

Access latency

48ns→21ns (2x)

↑

5% sys. performance

3.

Linked precharge:

Precharge latency

13ns→5ns (2x)

↑

8% sys. performance

Subarray

Subarray 2

undefined

Low-Cost Inter-Linked Subarrays (LISA)

Key Idea and Applications

•

Low-cost Inter-linked subarrays (LISA)

–

Fast bulk data movement b/w subarrays

–

Wide datapath via isolation transistors

: 0.8% DRAM chip area

•

LISA is a

versatile substrate

→

new applications

Fast bulk data copy

: Copy latency

1.363ms→0.148ms

(9.2x)

% speedup and -55% DRAM energy efficiency

In-DRAM caching

: Hot data access latency

48.7ns→21.5ns

(2.2x)

↑

5% sys. performance

Fast precharge

Precharge latency

13.1ns→5ns

(2.6x)

↑

8% sys. performance

Moving Data Inside DRAM?

…

Low connectivity in DRAM is the fundamental

bottleneck for bulk data movement

Goal: Provide a new substrate to enable wide

connectivity between subarrays

Low Connectivity in DRAM

Problem: Simply moving data inside DRAM is inefficient

Low connectivity in DRAM is the fundamental

bottleneck for bulk data movement

…

Low Connectivity in DRAM

Problem: Simply moving data inside DRAM is inefficient

Low connectivity in DRAM is the fundamental

bottleneck for bulk data movement

Slide Note

Embed Share

Download

This research focuses on improving bulk data movement efficiency within DRAM by introducing Low-Cost Inter-Linked Subarrays (LISA). By providing wide connectivity between subarrays, LISA enables fast inter-subarray data transfers, reducing latency and energy consumption. Key applications include fast data copying, in-DRAM caching for quick data access, and accelerated precharging. The outlined motivation, key ideas, and DRAM operation details shed light on the innovative approach of LISA in enhancing DRAM performance.

bradi Follow

Uploaded on Oct 08, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin Qureshi, and Onur Mutlu

Problem: Inefficient Bulk Data Movement Bulk data movement is a key operation in many applications memmove & memcpy:5% cycles in Google s datacenter [Kanev+ ISCA 15] Core Core Controller Memory src LLC Channel 64 bits dst Core Core CPU Memory Long latency and high energy 2

Moving Data Inside DRAM? 8Kb 512 rows Bank Subarray 1 Subarray 2 Subarray 3 Bank DRAM cell Bank Bank DRAM Subarray N Internal Data Bus (64b) Low connectivity in DRAM is the fundamental Low connectivity in DRAM is the fundamental bottleneck for bulk data movement bottleneck for bulk data movement wide connectivity between wide connectivity between subarrays Goal: Provide a new substrate to enable Goal: Provide a new substrate to enable subarrays 3

Key Idea and Applications Low-cost Inter-linked subarrays (LISA) Fast bulk data movement between subarrays Wide datapath via isolation transistors: 0.8% DRAM chip area Subarray 1 Subarray 2 LISA is a versatile substrate new applications Fast bulk data copy: Copy latency 1.363ms 0.148ms (9.2x) 66% speedup, -55% DRAM energy In-DRAM caching: Hot data access latency 48.7ns 21.5ns (2.2x) 5% speedup Fast precharge: Precharge latency 13.1ns 5.0ns (2.6x) 8% speedup 4

Outline Motivation and Key Idea DRAM Background LISA Substrate New DRAM Command to Use LISA Applications of LISA 5

DRAM Internals Subarray Subarray Internal Data Bus Bitline Decoder Wordline Row 512 x 8Kb S P S P S P S P 64b Row Buffer Sense amplifier Precharge unit I/O S P Bank (16~64 SAs) 8~16 banks per chip 6

DRAM Operation 1 ACTIVATE: Store the row into the row buffer 1 1 1 1 2 READ: Select the target column and drive to I/O To Bank I/O S P S P S P S P 3 PRECHARGE: Reset the bitlines for a new ACTIVATE Vdd/2 Vdd Bitline Voltage Level: 7

Outline Motivation and Key Idea DRAM Background LISA Substrate New DRAM Command to Use LISA Applications of LISA 8

Observations 1 Bitlines serve as a bus that is as wide as a row Internal Data Bus (64b) S P S P S P S P 2 Bitlines between subarrays are close but disconnected S P S P S P S P 9

Low-Cost Interlinked Subarrays (LISA) Interconnect bitlines of adjacent subarrays in a bank using isolation transistors (links) S P S P ON S P S P 8kb 64b S P S P S P S P LISA forms a wide datapath b/w subarrays 10

New DRAM Command to Use LISA Row Buffer Movement (RBM): Move a row of data in an activated row buffer to a precharged one Subarray 1 Vdd Vdd Vdd- Activated S P S P S P S P on Charge Sharing RBM: SA1 SA2 Subarray 2 Vdd Vdd/2 Vdd/2+ Precharged Activated RBM transfers an entire row b/w subarrays S P S P S P S P Amplify the charge 11

RBM Analysis The range of RBM depends on the DRAM design Multiple RBMs to move data across > 3 subarrays Subarray 1 Subarray 2 Subarray 3 Validated with SPICE using worst-case cells NCSU FreePDK 45nm library 4KB data in 8ns (w/ 60% guardband) 500 GB/s, 26x bandwidth of a DDR4-2400 channel 0.8% DRAM chip area overhead [O+ ISCA 14] 12

Outline Motivation and Key Idea DRAM Background LISA Substrate New DRAM Command to Use LISA Applications of LISA 1. Rapid Inter-Subarray Copying (RISC) 2. Variable Latency DRAM (VILLA) 3. Linked Precharge (LIP) 13

1 1. Rapid Inter-Subarray Copying (RISC) Goal: Efficiently copy a row across subarrays Key idea: Use RBM to form a new command sequence Subarray 1 src row 1 Activate src row S P S P S P S P 2 2 RBM SA1 SA2 Subarray 2 dst row Reduces row-copy latency by 9.2x, 3 3 Activate dst row (write row buffer into dst row) DRAM energy by 48.1x S P S P S P S P 14

Methodology Cycle-level simulator: Ramulator [CAL 15] https://github.com/CMU-SAFARI/ramulator CPU: 4 out-of-order cores, 4GHz L1: 64KB/core, L2: 512KB/core, L3: shared 4MB DRAM: DDR3-1600, 2 channels Benchmarks: Memory-intensive: TPC, STREAM, SPEC2006, DynoGraph, random Copy-intensive: Bootup, forkbench, shell script 50 workloads: Memory- + copy-intensive Performance metric: Weighted Speedup (WS) 15

Comparison Points Baseline: Copy data through CPU (existing systems) RowClone [Seshadri+ MICRO 13] In-DRAM bulk copy scheme Fast intra-subarray copying via bitlines Slow inter-subarray copying via internal data bus 16

System Evaluation: RISC RowClone RISC 75 66% 60 55% Over Baseline (%) Rapid Inter-Subarray Copying (RISC) using LISA 45 30 15 5% 0 -15 Degrades bank-level parallelism -24% -30 WS Improvement improves system performance DRAM Energy Reduction 17

2 2. Variable Latency DRAM (VILLA) Goal: Reduce DRAM latency with low area overhead Motivation: Trade-off between area and latency Long Bitline (DDRx) Short Bitline (RLDRAM) Shorter bitlines faster activate and precharge time High area overhead: >40% 18

2 2. Variable Latency DRAM (VILLA) Key idea: Reduce access latency of hot data via a heterogeneous DRAM design [Lee+ HPCA 13, Son+ ISCA 13] VILLA: Add fast subarrays as a cache in each bank 512 rows frequent movement of data rows Challenge: VILLA cache requires Slow Subarray 32 rows LISA: Cache rows rapidly from slow to fast subarrays Fast Subarray Reduces hot data access latency by 2.2x at only 1.6% area overhead Slow Subarray 19

System Evaluation: VILLA 50 quad-core workloads: memory-intensive benchmarks 80 1.16 Max: 16% VILLA VILLA Cache Hit Rate 70 Normalized Speedup 1.14 VILLA Cache Hit Rate (%) 60 1.12 50 1.1 40 1.08 30 1.06 Avg: 5% 20 1.04 1.02 Caching hot data in DRAM using LISA improves system performance 10 1 0 Workloads (50) 20

3 3. Linked Precharge (LIP) Problem: The precharge time is limited by the strength of one precharge unit Linked Precharge (LIP): LISA precharges a subarray using multiple precharge units S P S P S P S P S P S P S P S P on Activated row Linked Precharging Precharging Reduces precharge latency by 2.6x (43% guardband) S P S P S P S P S P S P S P S P on on Conventional DRAM LISA DRAM 21

System Evaluation: LIP 50 quad-core workloads: memory-intensive benchmarks 1.16 LIP Normalized Speedup 1.14 Max: 13% 1.12 1.1 Avg: 8% 1.08 1.06 1.04 1.02 Accelerating precharge using LISA Accelerating precharge using LISA improves system performance improves system performance 1 Workloads (50) 22

Other Results in Paper Combined applications Single-core results Sensitivity results LLC size Number of channels Copy distance Qualitative comparison to other hetero. DRAM Detailed quantitative comparison to RowClone 23

Summary Bulk data movement is inefficient in today s systems Low connectivity between subarrays is a bottleneck Low-cost Inter-linked subarrays (LISA) Bridge bitlines of subarrays via isolation transistors Wide datapath with 0.8% DRAM chip area LISA is a versatile substrate new applications Fast bulk data copy: 66% speedup, -55% DRAM energy In-DRAM caching: 5% speedup Fast precharge: 8% speedup LISA can enable other applications Source code will be available in April https://github.com/CMU-SAFARI 24

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin Qureshi, and Onur Mutlu

Backup 26

SPICE on RBM 27

Comparison to Prior Works Heterogeneous DRAM Designs TL-DRAM (Lee+ HPCA 13) CHARM (Son+ ISCA 13) VILLA Level of Heterogeneity Intra- Subarray Inter- Bank Inter- Subarray X Caching Latency X Cache Utilization 28

Comparison to Prior Works DRAM Designs DAS-DRAM (Lu+ MICRO 15) LISA Goal Heterogeneous DRAM design X Substrate for bulk data movement Enable other applications? Movement mechanism Scalable Copy Latency Migration cells Low-cost links X 29

LISA vs. Samsung Patent S.-Y. Seo, Methods of Copying a Page in a Memory Device and Methods of Managing Pages in a Memory System, U.S. Patent Application 20140185395, 2014 Only for copying data Vague. Lack of detail on implementation How does data get moved? What are the steps? No analysis on the latency and energy No system evaluation 30

RBM Across 3 Subarrays Subarray 1 RBM: SA1 SA3 S P S P S P S P Subarray 2 S P S P S P S P Subarray 3 S P S P S P S P 31

Comparison of Inter-Subarray Row Copying RISC RowClone [MICRO'13] memcpy (baseline) 7 6 DRAM Energy ( J) 5 4 3 2 1715 hops9x latency and 48x energy reduction 1 0 0 200 400 600 Latency (ns) 800 1000 1200 1400 32

RISC: Cache Coherence Data in DRAM may not be up-to-date MC performs flushes dirty data (src) and invalidates dst Techniques to accelerate cache coherence Dirty-Block Index [Seshadri+ ISCA 14] Other papers handle the similar issue [Seshadri+ MICRO 13, CAL 15] 33

RISC vs. RowClone 4-core results RowClone 34

Sensitivity of Cache Size Single core: RISC vs. baseline as LLC size changes Baseline: higher cache pollution as LLC size decreases Forkbench RISC: Hit rate 67% (4MB) to 10% (256KB) Base: Hit rate 20% to 19% 35

Combined Applications +8% +16% 59% 36

Sensitivity to Copy Distance 37

VILLA Caching Policy Benefit-based caching policy [HPCA 13] A benefit counter to track # of accesses per cached row Any caching policy can be applied to VILLA Configuration 32 rows inside a fast subarray 4 fast subarrays per bank 1.6% area overhead 38

Area Measurement Row-Buffer Decoupling by O et al., ISCA 14 28nm DRAM process, 3 metal layers 8Gb and 8 banks per device 39

Other slides 40

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin Qureshi, and Onur Mutlu

3 3. Linked Precharge (LIP) Problem: The precharge time is limited by the strength of one precharge unit (PU) Linked Precharge (LIP): LISA s connectivity enables DRAM to utilize additional PUs from other subarrays S P S P S P S P S P S P S P S P on Activated row Linked Precharging Precharging S P S P S P S P S P S P S P S P on on Conventional DRAM LISA DRAM 42

Key Idea and Applications Low-cost Inter-linked subarrays (LISA) Fast bulk data movement b/w subarrays Wide datapath via isolation transistors: 0.8% DRAM chip area Subarray 1 Subarray 2 LISA is a versatile substrate new applications 1. Fast bulk data copy: Copy latency 1.3ms 0.1ms (9x) 66% sys. performance and 55% energy efficiency 2. In-DRAM caching: Access latency 48ns 21ns (2x) 5% sys. performance 3. Linked precharge: Precharge latency 13ns 5ns (2x) 8% sys. performance 43

Low-Cost Inter-Linked Subarrays (LISA) Decoder Row S P S P S P S P LISA link 44

Key Idea and Applications Low-cost Inter-linked subarrays (LISA) Fast bulk data movement b/w subarrays Wide datapath via isolation transistors: 0.8% DRAM chip area Subarray 1 Subarray 2 LISA is a versatile substrate new applications Fast bulk data copy: Copy latency 1.363ms 0.148ms (9.2x) +66% speedup and -55% DRAM energy efficiency In-DRAM caching: Hot data access latency 48.7ns 21.5ns (2.2x) 5% sys. performance Fast precharge: Precharge latency 13.1ns 5ns (2.6x) 8% sys. performance 46

Moving Data Inside DRAM? 8Kb 512 rows Bank Subarray 1 Subarray 2 Subarray 3 Bank DRAM cell Bank Bank DRAM Low connectivity in DRAM is the fundamental bottleneck for bulk data movement Goal: Provide a new substrate to enable wide connectivity between subarrays Subarray N Internal Data Bus (64b) 47

Low Connectivity in DRAM Problem: Simply moving data inside DRAM is inefficient 8Kb 512 rows Bank Subarray 1 Subarray 2 Subarray 3 Bank DRAM cell Bank Bank DRAM Goal: Provide a new substrate to Subarray N Internal Data Bus (64b) Low connectivity in DRAM is the fundamental bottleneck for bulk data movement enable wide connectivity b/w subarrays 48

Low Connectivity in DRAM Problem: Simply moving data inside DRAM is inefficient 8Kb 512 rows Bank Bank DRAM cell Bank Bank DRAM Internal Data Bus (64b) Low connectivity in DRAM is the fundamental bottleneck for bulk data movement 49

Enhancing Data Movement Efficiency in DRAM with Low-Cost Inter-Linked Subarrays (LISA)

Download Presentation

Presentation Transcript

Related

More Related Content