Improving Cache Performance Through Read-Write Disparity

Improving Cache Performance

by Exploiting Read-Write Disparity

Samira Khan

Alaa R. Alameldeen, Chris Wilkerson,

Onur Mutlu, and Daniel A. Jiménez

Summary

•

Read misses are more critical than write misses

•

Read misses can stall processor, writes are not on the critical path

•

Problem:

Cache management does not exploit read-write

disparity

•

Goal:

 Design a cache that

favors reads over writes to

improve performance

•

Lines that ar

only written to

are

less critical

•

Prioritize

lines that service

read requests

•

Key observation:

Applications differ in their read reuse

behavior in clean and dirty lines

•

Idea:

 Read-Write Partitioning

•

Dynamically partition the

cache between clean and dirty lines

•

Protect the partition that has more read hits

•

Improves performance over three recent mechanisms

Outline

•

Motivation

•

Reuse Behavior of Dirty Lines

•

Read-Write Partitioning

•

Results

•

Conclusion

Motivation

time

STALL

STALL

Rd A

Wr B

Rd C

Buffer/writeback B

Cache management does not exploit

the disparity between read-write requests

•

Read and write misses are not equally critical

•

Read misses are more critical than write misses

•

Read misses can stall the processor

•

Writes are not on the critical path

Key Idea

Rd

Wr

Rd

Wr

Rd D

•

Favor reads over writes in cache

•

Differentiate between

read

vs.

only written to

lines

•

Cache should protect lines that serve read requests

•

Lines that are

only written to

are

less critical

•

Improve performance by maximizing

read hits

•

An Example

Read-Only

Read and Written

Write-Only

An Example

Rd A

STALL

WR B

Rd B

Wr C

Rd D

STALL

Rd A

Wr B

Rd B

WR C

Rd D

STALL

Write B

Write C

LRU Replacement Policy

Write C

Write B

Read-Biased Replacement Policy

Replace C

Rd

Wr

Rd

Wr

Rd D

2 stalls per iteration

1 stall per iteration

Evicting lines that are only written to

 can improve performance

 cycles

saved

Dirty lines are treated differently

depending on read requests

Outline

•

Motivation

•

Reuse Behavior of Dirty Lines

•

Read-Write Partitioning

•

Results

•

Conclusion

Reuse Behavior of Dirty Lines

•

Not all dirty lines are the same

•

Write-only Lines

•

Do not receive read requests, can be evicted

•

Read-Write Lines

•

Receive read requests, should be kept in the cache

Evicting write-only lines provides more space

for read lines and can improve performance

Reuse Behavior of Dirty Lines

On average 37.4% lines are write-only,

9.4% lines are both read and written

Applications have different read reuse behavior

in dirty lines

Outline

•

Motivation

•

Reuse Behavior of Dirty Lines

•

Read-Write Partitioning

•

Results

•

Conclusion

Read-Write Partitioning

•

Goal:

Exploit different read reuse behavior in dirty

lines to maximize number of read hits

•

Observation:

–

Some applications have more reads to clean lines

–

Other applications have more reads to dirty lines

•

Read-Write Partitioning:

–

Dynamically partitions the cache

in clean and dirty lines

–

Evict lines from the partition that has less read reuse

Improves performance by protecting lines

with more read reuse

Read-Write Partitioning

Applications have significantly different

read reuse behavior in clean and dirty lines

Read-Write Partitioning

•

Utilize disparity in read reuse in clean and dirty lines

•

Partition the cache

into clean and dirty lines

•

Predict the partition size that maximizes read hits

•

Maintain the partition through replacement

–

DIP

Qureshi

et al.

selects victim within the partition

Cache Sets

Dirty Lines

Clean Lines

Predicted Best

Partition Size 3

Replace from

dirty partition

Predicting Partition Size

•

Predicts partition size using sampled shadow tags

–

Based on utility-based partitioning

[Qureshi

et al. 20

06]

•

Counts the number of read hits in clean and dirty lines

•

Picks the partition (x, associativity – x) that maximizes

number of read hits

Dirty

Clean

Maximum number of read hits

Outline

•

Motivation

•

Reuse Behavior of Dirty Lines

•

Read-Write Partitioning

•

Results

•

Conclusion

Methodology

•

CMP$im x86 cycle-accurate simulator

[Jaleel

et al.

2008]

•

4MB 16-way set-associative LLC

•

32KB I+D L1, 256KB L2

•

200-cycle DRAM access time

•

550m representative instructions

•

Benchmarks:

–

10 memory-intensive SPEC benchmarks

–

35 multi-programmed applications

Comparison Points

•

DIP, RRIP: Insertion Policy

[Qureshi

et al.

2007, Jaleel

et al.

2010]

–

Avoid thrashing and cache pollution

•

Dynamically insert lines at different stack positions

–

Low overhead

–

Do not differentiate between read-write accesses

•

SUP+: Single-Use Reference Predictor

Piquet

et al.

–

Avoids cache pollution

•

Bypasses lines that do not receive re-references

–

High accuracy

–

Does not differentiate between read-write accesses

•

Does not bypass write-only lines

–

High storage overhead, needs PC in LLC

Comparison Points:

Read Reference Predictor (RRP)

•

A new predictor inspired by prior works

[Tyson

et al.

Piquet

et al.

•

Identifies read and write-only lines by allocating PC

–

Bypasses write-only lines

•

Writebacks are not associated with any PC

Marks P as a PC that allocates a line that is never read again

Associates the allocating PC in L1 and passes PC in L2, LLC

High storage overhead

PC Q: Wb A

No

allocating PC

Allocating

PC from L1

Time

Single Core Performance

Differentiating read vs. write-only lines

improves performance over recent mechanisms

2.6KB

48.4KB

RWP performs within 3.4% of RRP,

But requires 18X less storage overhead

4 Core Performance

Differentiating read vs. write-only lines

improves performance over recent mechanisms

+4.5%

+8%

More benefit when more applications

are memory intensive

Average Memory Traffic

Increases writeback traffic by 2.5%,

but reduces overall memory traffic by 16%

Dirty Partition Sizes

Partition size varies significantly

for some benchmarks

Dirty Partition Sizes

Partition size varies significantly during the runtime

for some benchmarks

Outline

•

Motivation

•

Reuse Behavior of Dirty Lines

•

Read-Write Partitioning

•

Results

•

Conclusion

Conclusion

•

Problem:

Cache management does not exploit

read-write disparity

•

Goal:

Design a cache that favors read requests over

write requests to improve performance

–

Lines that are

only written to

are

less critical

–

Protect

lines that serve

read requests

•

Key observation:

Applications differ in their read

reuse behavior in clean and dirty lines

•

Idea:

 Read-Write Partitioning

–

Dynamically partition the cache in clean and dirty lines

–

Protect the partition that has more read hits

•

Results:

Improves performance over three recent

mechanisms

Thank you

Improving Cache Performance

by Exploiting Read-Write Disparity

Samira Khan

Alaa R. Alameldeen, Chris Wilkerson,

Onur Mutlu, and Daniel A. Jiménez

Slide Note

Embed Share

Download

This study explores how exploiting the difference between read and write requests can enhance cache performance by prioritizing read over write operations. By dynamically partitioning the cache and protecting lines with more read hits, the proposed method demonstrates significant performance improvement compared to existing mechanisms.

rjay Follow

Uploaded on Sep 26, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jim nez

Summary Read misses are more critical than write misses Read misses can stall processor, writes are not on the critical path Problem: Cache management does not exploit read-write disparity Goal: Design a cache that favors reads over writes to improve performance Lines that are only written to are less critical Prioritize lines that service read requests Key observation: Applications differ in their read reuse behavior in clean and dirty lines Idea: Read-Write Partitioning Dynamically partition the cache between clean and dirty lines Protect the partition that has more read hits Improves performance over three recent mechanisms 2

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 3

Motivation Read and write misses are not equally critical Read misses are more critical than write misses Read misses can stall the processor Writes are not on the critical path Wr B Rd C Rd A STALL STALL Buffer/writeback B time Cache management does not exploit the disparity between read-write requests 4

Key Idea Favor reads over writes in cache Differentiate between read vs. only written to lines Cache should protect lines that serve read requests Lines that are only written to are less critical Improve performance by maximizing read hits An Example Rd D Wr C Rd B Wr B Rd A D A B C Read and Written Write-Only Read-Only 5

An Example Rd D Wr C Rd B Wr B Rd A Wr C M Rd D M WR B M Rd B Rd A M STALL D STALL A H Write B Write C B A C B A D A D D C 2 stalls per iteration C B C B B D cycles saved LRU Replacement Policy Wr B H Rd B HWR C M Rd D M Rd A H STALL Write C Write B D B B B A A A D Replace C D A D 1 stall per iteration B B A B A D C Read-Biased Replacement Policy Evicting lines that are only written to can improve performance depending on read requests Dirty lines are treated differently 6

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 7

Reuse Behavior of Dirty Lines Not all dirty lines are the same Write-only Lines Do not receive read requests, can be evicted Read-Write Lines Receive read requests, should be kept in the cache Evicting write-only lines provides more space for read lines and can improve performance 8

Reuse Behavior of Dirty Lines 100 Dirty (write-only) Dirty (read-write) Percentage of Cachelines in LLC 90 80 70 60 50 40 30 20 10 0 447.dealII 458.sjeng 471.omnetpp 403.gcc 470.lbm 429.mcf 435.gromacs 450.soplex 456.hmmer 434.zeusmp 445.gobmk 400.perlbench 401.bzip2 464.h264ref 410.bwaves 433.milc 481.wrf 437.leslie3d 473.astar 465.tonto 483.xalancbmk 459.GemsFDTD 436.cactusADM 482.sphinx3 462.libquantum Applications have different read reuse behavior On average 37.4% lines are write-only, 9.4% lines are both read and written in dirty lines 9

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 10

Read-Write Partitioning Goal: Exploit different read reuse behavior in dirty lines to maximize number of read hits Observation: Some applications have more reads to clean lines Other applications have more reads to dirty lines Read-Write Partitioning: Dynamically partitions the cache in clean and dirty lines Evict lines from the partition that has less read reuse Improves performance by protecting lines with more read reuse 11

Read-Write Partitioning 10 4 Soplex Xalanc to Reads in clean lines at 100m to Reads in clean lines at 100m Number of Reads Normalized Number of Reads Normalized 9 3.5 Clean Line Dirty Line Clean Line Dirty Line 8 3 7 2.5 6 5 2 4 1.5 3 1 2 0.5 1 0 0 0 100 200 300 400 500 0 100 200 300 400 500 Instructions (M) Instructions (M) Applications have significantly different read reuse behavior in clean and dirty lines 12

Read-Write Partitioning Utilize disparity in read reuse in clean and dirty lines Partition the cache into clean and dirty lines Predict the partition size that maximizes read hits Maintain the partition through replacement DIP [Qureshi et al. 2007] selects victim within the partition Predicted Best Partition Size 3 Replace from dirty partition Dirty Lines Clean Lines 13 Cache Sets

Predicting Partition Size Predicts partition size using sampled shadow tags Based on utility-based partitioning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the partition (x, associativity x) that maximizes number of read hits Maximum number of read hits E R S C O U N T E R S C O U N T S H A D O W MRU-1 S H A D O W MRU-1 A M L S E D P LRU+1 LRU+1 MRU MRU LRU LRU T A G S T A G S T S S E Dirty Clean 14

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 15

Methodology CMP$im x86 cycle-accurate simulator [Jaleel et al. 2008] 4MB 16-way set-associative LLC 32KB I+D L1, 256KB L2 200-cycle DRAM access time 550m representative instructions Benchmarks: 10 memory-intensive SPEC benchmarks 35 multi-programmed applications 16

Comparison Points DIP, RRIP: Insertion Policy [Qureshi et al. 2007, Jaleel et al. 2010] Avoid thrashing and cache pollution Dynamically insert lines at different stack positions Low overhead Do not differentiate between read-write accesses SUP+: Single-Use Reference Predictor [Piquet et al. 2007] Avoids cache pollution Bypasses lines that do not receive re-references High accuracy Does not differentiate between read-write accesses Does not bypass write-only lines High storage overhead, needs PC in LLC 17

Comparison Points: Read Reference Predictor (RRP) A new predictor inspired by prior works [Tyson et al. 1995, Piquet et al. 2007] Identifies read and write-only lines by allocating PC Bypasses write-only lines Writebacks are not associated with any PC Allocating No allocating PC PC from L1 PC P: Rd A Wb A Wb A Wb A PC Q: Wb A Wb A Wb A Time Marks P as a PC that allocates a line that is never read again Associates the allocating PC in L1 and passes PC in L2, LLC High storage overhead 18

Single Core Performance 48.4KB 2.6KB 1.20 Speedup vs. Baseline LRU 1.15 1.10 1.05 1.00 DIP RWP performs within 3.4% of RRP, RRIP SUP+ RRP RWP Differentiating read vs. write-only lines improves performance over recent mechanisms But requires 18X less storage overhead 19

4 Core Performance Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP 1.14 1.12 1.10 +8% 1.08 +4.5% 1.06 1.04 1.02 1.00 No Memory Intensive 1 Memory Intensive 2 Memory Intensive 3 Memory Intensive 4 Memory Intensive More benefit when more applications Differentiating read vs. write-only lines improves performance over recent mechanisms are memory intensive 20

Average Memory Traffic 120 Writeback Miss Percentage of Memory Traffic 100 15% 80 17% 60 40 85% 66% 20 0 Base RWP Increases writeback traffic by 2.5%, but reduces overall memory traffic by 16% 21

Dirty Partition Sizes 20 Natural Dirty Partition Predicted Dirty Partition 18 Number of Cachelines 16 14 12 10 8 6 4 2 0 481.wrf 473.astar 433.milc 453.povray 465.tonto 400.perlbench 403.gcc 416.gamess 435.gromacs 444.namd 464.h264ref 410.bwaves 429.mcf 434.zeusmp 450.soplex 454.calculix 456.hmmer 471.omnetpp 483.xalancbmk 458.sjeng 459.GemsFDTD 462.libquantum 482.sphinx3 401.bzip2 447.dealII 470.lbm 436.cactusADM 437.leslie3d 445.gobmk Partition size varies significantly for some benchmarks 22

Dirty Partition Sizes 20 Natural Dirty Partition Predicted Dirty Partition 18 Number of Cachelines 16 14 12 10 8 6 4 2 0 481.wrf 473.astar 433.milc 453.povray 465.tonto 400.perlbench 403.gcc 416.gamess 435.gromacs 444.namd 464.h264ref 410.bwaves 429.mcf 434.zeusmp 450.soplex 454.calculix 456.hmmer 471.omnetpp 483.xalancbmk 458.sjeng 459.GemsFDTD 462.libquantum 482.sphinx3 401.bzip2 447.dealII 470.lbm 436.cactusADM 437.leslie3d 445.gobmk Partition size varies significantly during the runtime for some benchmarks 23

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 24

Conclusion Problem: Cache management does not exploit read-write disparity Goal:Design a cache that favors read requests over write requests to improve performance Lines that are only written to are less critical Protect lines that serve read requests Key observation: Applications differ in their read reuse behavior in clean and dirty lines Idea: Read-Write Partitioning Dynamically partition the cache in clean and dirty lines Protect the partition that has more read hits Results: Improves performance over three recent mechanisms 25

Thank you 26

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jim nez

Improving Cache Performance Through Read-Write Disparity

Download Presentation

Presentation Transcript

Related

More Related Content