Improving Cache Performance Through Read-Write Disparity

 
Improving Cache Performance
by Exploiting Read-Write Disparity
 
Samira Khan
,
 
Alaa R. Alameldeen, Chris Wilkerson,
Onur Mutlu, and Daniel A. Jiménez
Summary
 
Read misses are more critical than write misses
Read misses can stall processor, writes are not on the critical path
Problem: 
Cache management does not exploit read-write
disparity
Goal:
 Design a cache that 
favors reads over writes to
improve performance
Lines that ar
e
 
only written to 
are
 
less critical
Prioritize
 
lines that service 
read requests
Key observation: 
Applications differ in their read reuse
behavior in clean and dirty lines
Idea:
 Read-Write Partitioning
Dynamically partition the 
cache between clean and dirty lines
Protect the partition that has more read hits
Improves performance over three recent mechanisms
 
2
Outline
 
Motivation
Reuse Behavior of Dirty Lines
Read-Write Partitioning
Results
Conclusion
 
3
 
Motivation
 
time
STALL
STALL
 
Rd A
 
Wr B
 
Rd C
 
Buffer/writeback B
 
Cache management does not exploit
the disparity between read-write requests
 
Read and write misses are not equally critical
Read misses are more critical than write misses
Read misses can stall the processor
Writes are not on the critical path
4
Key Idea
Rd
 A
Wr
 B
Rd
 B
Wr
 C
Rd D
 
Favor reads over writes in cache
Differentiate between 
read
 vs. 
only written to
 
lines
Cache should protect lines that serve read requests
Lines that are 
only written to 
are
 
less critical
Improve performance by maximizing 
read hits
An Example
D
B
C
A
 
Read-Only
 
Read and Written
 
Write-Only
5
An Example
 
Rd A 
M
STALL
 
WR B 
M
 
Rd B
H
 
Wr C 
M
 
Rd D 
M
STALL
 
Rd A 
H
 
Wr B 
H
 
Rd B 
H
 
WR C 
M
 
Rd D 
M
STALL
 
Write B
 
Write C
 
LRU Replacement Policy
 
Write C
 
Write B
 
Read-Biased Replacement Policy
Replace C
Rd
 A
Wr
 B
Rd
 B
Wr
 C
Rd D
6
2 stalls per iteration 
1 stall per iteration
 
Evicting lines that are only written to
 can improve performance
 
 cycles
saved
 
Dirty lines are treated differently
depending on read requests
 
Outline
 
Motivation
Reuse Behavior of Dirty Lines
Read-Write Partitioning
Results
Conclusion
 
 
7
Reuse Behavior of Dirty Lines
8
 
Not all dirty lines are the same
Write-only Lines
Do not receive read requests, can be evicted
Read-Write Lines
Receive read requests, should be kept in the cache
 
Evicting write-only lines provides more space
for read lines and can improve performance
Reuse Behavior of Dirty Lines
 
On average 37.4% lines are write-only,
9.4% lines are both read and written
9
 
Applications have different read reuse behavior
in dirty lines
 
Outline
 
Motivation
Reuse Behavior of Dirty Lines
Read-Write Partitioning
Results
Conclusion
 
 
10
Read-Write Partitioning
 
Goal: 
Exploit different read reuse behavior in dirty
lines to maximize number of read hits
Observation:
Some applications have more reads to clean lines
Other applications have more reads to dirty lines
Read-Write Partitioning:
Dynamically partitions the cache 
in clean and dirty lines
Evict lines from the partition that has less read reuse
 
 
Improves performance by protecting lines
with more read reuse
11
Read-Write Partitioning
 
Applications have significantly different
read reuse behavior in clean and dirty lines
12
Read-Write Partitioning
 
Utilize disparity in read reuse in clean and dirty lines
Partition the cache 
into clean and dirty lines
Predict the partition size that maximizes read hits
Maintain the partition through replacement
DIP 
[
Qureshi 
et al. 
2007
] 
selects victim within the partition
Cache Sets
13
 
Dirty Lines
 
Clean Lines
Predicted Best
Partition Size 3
Replace from
dirty partition
Predicting Partition Size
 
Predicts partition size using sampled shadow tags
Based on utility-based partitioning 
[Qureshi 
et al. 20
06]
Counts the number of read hits in clean and dirty lines
Picks the partition (x, associativity – x) that maximizes
number of read hits
14
 
Dirty
 
Clean
Maximum number of read hits
 
Outline
 
Motivation
Reuse Behavior of Dirty Lines
Read-Write Partitioning
Results
Conclusion
 
 
15
Methodology
 
CMP$im x86 cycle-accurate simulator 
[Jaleel 
et al. 
2008]
4MB 16-way set-associative LLC
32KB I+D L1, 256KB L2
200-cycle DRAM access time
550m representative instructions
Benchmarks:
10 memory-intensive SPEC benchmarks
35 multi-programmed applications
16
Comparison Points
 
DIP, RRIP: Insertion Policy 
[Qureshi 
et al. 
2007, Jaleel 
et al. 
2010]
Avoid thrashing and cache pollution
Dynamically insert lines at different stack positions
Low overhead
Do not differentiate between read-write accesses
SUP+: Single-Use Reference Predictor 
[
Piquet
 
et al.
 
2007
]
Avoids cache pollution
Bypasses lines that do not receive re-references
High accuracy
Does not differentiate between read-write accesses
Does not bypass write-only lines
High storage overhead, needs PC in LLC
17
Comparison Points:
Read Reference Predictor (RRP)
 
A new predictor inspired by prior works 
[Tyson 
et al. 
1995
,
Piquet
 
et al.
 
2007
]
Identifies read and write-only lines by allocating PC
Bypasses write-only lines
Writebacks are not associated with any PC
18
 
Marks P as a PC that allocates a line that is never read again
 
Associates the allocating PC in L1 and passes PC in L2, LLC
 
High storage overhead
PC Q: Wb A
No
allocating PC
Allocating
PC from L1
 
Time
Single Core Performance
 
Differentiating read vs. write-only lines
improves performance over recent mechanisms
19
2.6KB
48.4KB
 
RWP performs within 3.4% of RRP,
But requires 18X less storage overhead
4 Core Performance
 
Differentiating read vs. write-only lines
improves performance over recent mechanisms
20
 
+4.5%
 
+8%
 
More benefit when more applications
are memory intensive
Average Memory Traffic
 
Increases writeback traffic by 2.5%,
but reduces overall memory traffic by 16%
21
Dirty Partition Sizes
22
 
Partition size varies significantly
for some benchmarks
Dirty Partition Sizes
23
 
Partition size varies significantly during the runtime
for some benchmarks
 
Outline
 
Motivation
Reuse Behavior of Dirty Lines
Read-Write Partitioning
Results
Conclusion
 
 
24
Conclusion
 
Problem:
 
Cache management does not exploit
read-write disparity
Goal:
 
Design a cache that favors read requests over
write requests to improve performance
Lines that are 
only written to 
are
 
less critical
Protect
 
lines that serve 
read requests
Key observation: 
Applications differ in their read
reuse behavior in clean and dirty lines
Idea:
 Read-Write Partitioning
Dynamically partition the cache in clean and dirty lines
Protect the partition that has more read hits
Results: 
Improves performance over three recent
mechanisms
25
 
 
 
 
 
Thank you
 
26
 
Improving Cache Performance
by Exploiting Read-Write Disparity
 
Samira Khan
,
 
Alaa R. Alameldeen, Chris Wilkerson,
Onur Mutlu, and Daniel A. Jiménez
Slide Note
Embed
Share

This study explores how exploiting the difference between read and write requests can enhance cache performance by prioritizing read over write operations. By dynamically partitioning the cache and protecting lines with more read hits, the proposed method demonstrates significant performance improvement compared to existing mechanisms.

  • Cache management
  • Read-write disparity
  • Cache performance
  • Dynamic partitioning
  • Performance optimization

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jim nez

  2. Summary Read misses are more critical than write misses Read misses can stall processor, writes are not on the critical path Problem: Cache management does not exploit read-write disparity Goal: Design a cache that favors reads over writes to improve performance Lines that are only written to are less critical Prioritize lines that service read requests Key observation: Applications differ in their read reuse behavior in clean and dirty lines Idea: Read-Write Partitioning Dynamically partition the cache between clean and dirty lines Protect the partition that has more read hits Improves performance over three recent mechanisms 2

  3. Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 3

  4. Motivation Read and write misses are not equally critical Read misses are more critical than write misses Read misses can stall the processor Writes are not on the critical path Wr B Rd C Rd A STALL STALL Buffer/writeback B time Cache management does not exploit the disparity between read-write requests 4

  5. Key Idea Favor reads over writes in cache Differentiate between read vs. only written to lines Cache should protect lines that serve read requests Lines that are only written to are less critical Improve performance by maximizing read hits An Example Rd D Wr C Rd B Wr B Rd A D A B C Read and Written Write-Only Read-Only 5

  6. An Example Rd D Wr C Rd B Wr B Rd A Wr C M Rd D M WR B M Rd B Rd A M STALL D STALL A H Write B Write C B A C B A D A D D C 2 stalls per iteration C B C B B D cycles saved LRU Replacement Policy Wr B H Rd B HWR C M Rd D M Rd A H STALL Write C Write B D B B B A A A D Replace C D A D 1 stall per iteration B B A B A D C Read-Biased Replacement Policy Evicting lines that are only written to can improve performance depending on read requests Dirty lines are treated differently 6

  7. Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 7

  8. Reuse Behavior of Dirty Lines Not all dirty lines are the same Write-only Lines Do not receive read requests, can be evicted Read-Write Lines Receive read requests, should be kept in the cache Evicting write-only lines provides more space for read lines and can improve performance 8

  9. Reuse Behavior of Dirty Lines 100 Dirty (write-only) Dirty (read-write) Percentage of Cachelines in LLC 90 80 70 60 50 40 30 20 10 0 447.dealII 458.sjeng 471.omnetpp 403.gcc 470.lbm 429.mcf 435.gromacs 450.soplex 456.hmmer 434.zeusmp 445.gobmk 400.perlbench 401.bzip2 464.h264ref 410.bwaves 433.milc 481.wrf 437.leslie3d 473.astar 465.tonto 483.xalancbmk 459.GemsFDTD 436.cactusADM 482.sphinx3 462.libquantum Applications have different read reuse behavior On average 37.4% lines are write-only, 9.4% lines are both read and written in dirty lines 9

  10. Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 10

  11. Read-Write Partitioning Goal: Exploit different read reuse behavior in dirty lines to maximize number of read hits Observation: Some applications have more reads to clean lines Other applications have more reads to dirty lines Read-Write Partitioning: Dynamically partitions the cache in clean and dirty lines Evict lines from the partition that has less read reuse Improves performance by protecting lines with more read reuse 11

  12. Read-Write Partitioning 10 4 Soplex Xalanc to Reads in clean lines at 100m to Reads in clean lines at 100m Number of Reads Normalized Number of Reads Normalized 9 3.5 Clean Line Dirty Line Clean Line Dirty Line 8 3 7 2.5 6 5 2 4 1.5 3 1 2 0.5 1 0 0 0 100 200 300 400 500 0 100 200 300 400 500 Instructions (M) Instructions (M) Applications have significantly different read reuse behavior in clean and dirty lines 12

  13. Read-Write Partitioning Utilize disparity in read reuse in clean and dirty lines Partition the cache into clean and dirty lines Predict the partition size that maximizes read hits Maintain the partition through replacement DIP [Qureshi et al. 2007] selects victim within the partition Predicted Best Partition Size 3 Replace from dirty partition Dirty Lines Clean Lines 13 Cache Sets

  14. Predicting Partition Size Predicts partition size using sampled shadow tags Based on utility-based partitioning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the partition (x, associativity x) that maximizes number of read hits Maximum number of read hits E R S C O U N T E R S C O U N T S H A D O W MRU-1 S H A D O W MRU-1 A M L S E D P LRU+1 LRU+1 MRU MRU LRU LRU T A G S T A G S T S S E Dirty Clean 14

  15. Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 15

  16. Methodology CMP$im x86 cycle-accurate simulator [Jaleel et al. 2008] 4MB 16-way set-associative LLC 32KB I+D L1, 256KB L2 200-cycle DRAM access time 550m representative instructions Benchmarks: 10 memory-intensive SPEC benchmarks 35 multi-programmed applications 16

  17. Comparison Points DIP, RRIP: Insertion Policy [Qureshi et al. 2007, Jaleel et al. 2010] Avoid thrashing and cache pollution Dynamically insert lines at different stack positions Low overhead Do not differentiate between read-write accesses SUP+: Single-Use Reference Predictor [Piquet et al. 2007] Avoids cache pollution Bypasses lines that do not receive re-references High accuracy Does not differentiate between read-write accesses Does not bypass write-only lines High storage overhead, needs PC in LLC 17

  18. Comparison Points: Read Reference Predictor (RRP) A new predictor inspired by prior works [Tyson et al. 1995, Piquet et al. 2007] Identifies read and write-only lines by allocating PC Bypasses write-only lines Writebacks are not associated with any PC Allocating No allocating PC PC from L1 PC P: Rd A Wb A Wb A Wb A PC Q: Wb A Wb A Wb A Time Marks P as a PC that allocates a line that is never read again Associates the allocating PC in L1 and passes PC in L2, LLC High storage overhead 18

  19. Single Core Performance 48.4KB 2.6KB 1.20 Speedup vs. Baseline LRU 1.15 1.10 1.05 1.00 DIP RWP performs within 3.4% of RRP, RRIP SUP+ RRP RWP Differentiating read vs. write-only lines improves performance over recent mechanisms But requires 18X less storage overhead 19

  20. 4 Core Performance Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP 1.14 1.12 1.10 +8% 1.08 +4.5% 1.06 1.04 1.02 1.00 No Memory Intensive 1 Memory Intensive 2 Memory Intensive 3 Memory Intensive 4 Memory Intensive More benefit when more applications Differentiating read vs. write-only lines improves performance over recent mechanisms are memory intensive 20

  21. Average Memory Traffic 120 Writeback Miss Percentage of Memory Traffic 100 15% 80 17% 60 40 85% 66% 20 0 Base RWP Increases writeback traffic by 2.5%, but reduces overall memory traffic by 16% 21

  22. Dirty Partition Sizes 20 Natural Dirty Partition Predicted Dirty Partition 18 Number of Cachelines 16 14 12 10 8 6 4 2 0 481.wrf 473.astar 433.milc 453.povray 465.tonto 400.perlbench 403.gcc 416.gamess 435.gromacs 444.namd 464.h264ref 410.bwaves 429.mcf 434.zeusmp 450.soplex 454.calculix 456.hmmer 471.omnetpp 483.xalancbmk 458.sjeng 459.GemsFDTD 462.libquantum 482.sphinx3 401.bzip2 447.dealII 470.lbm 436.cactusADM 437.leslie3d 445.gobmk Partition size varies significantly for some benchmarks 22

  23. Dirty Partition Sizes 20 Natural Dirty Partition Predicted Dirty Partition 18 Number of Cachelines 16 14 12 10 8 6 4 2 0 481.wrf 473.astar 433.milc 453.povray 465.tonto 400.perlbench 403.gcc 416.gamess 435.gromacs 444.namd 464.h264ref 410.bwaves 429.mcf 434.zeusmp 450.soplex 454.calculix 456.hmmer 471.omnetpp 483.xalancbmk 458.sjeng 459.GemsFDTD 462.libquantum 482.sphinx3 401.bzip2 447.dealII 470.lbm 436.cactusADM 437.leslie3d 445.gobmk Partition size varies significantly during the runtime for some benchmarks 23

  24. Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 24

  25. Conclusion Problem: Cache management does not exploit read-write disparity Goal:Design a cache that favors read requests over write requests to improve performance Lines that are only written to are less critical Protect lines that serve read requests Key observation: Applications differ in their read reuse behavior in clean and dirty lines Idea: Read-Write Partitioning Dynamically partition the cache in clean and dirty lines Protect the partition that has more read hits Results: Improves performance over three recent mechanisms 25

  26. Thank you 26

  27. Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jim nez

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#