Efficient Cache Management using The Dirty-Block Index

The Dirty-Block Index
Vivek Seshadri
Abhishek
 
Bhowmick
 
 
Onur
 
Mutlu
Phillip B. Gibbons
 
 
Michael A. Kozuch
 
 
Todd
 
C.
 
Mowry
Summary
 
Problem: Dirty bit organization in caches does not match queries
Inefficiency and performance loss
The Dirty-Block Index (DBI)
Remove dirty bits from cache tag store
DRAM row-oriented organization of dirty bits
Efficiently respond to queries
Get all dirty blocks of a DRAM row; Is block B dirty?
Enables efficient implementation of many optimizations
DRAM-aware writeback, bypassing cache lookup, reducing ECC cost
, …
Improves performance while reducing overall cache area
28% performance over baseline, 6% over state-of-the-art  (8-core)
8% cache area reduction
2
Information: Organization and Query
3
 
Organization
Mismatch between Organization and Query
4
Get all the
books written by
author X
Bad
organization
for the query
Metadata: Information About a Cache Block
5
Block Address
Block-Oriented Metadata Organization
6
Block-Oriented Metadata Organization
7
 
  
Simple to Implement
 
  
Scalable
 
Any metadata query requires
an expensive tag store lookup
Is this the best organization?
Block-Oriented Metadata Organization
8
 
  
Simple to Implement
 
  
Scalable
 
Any metadata query requires
an expensive tag store lookup
Is this the best organization?
Focus of This Work
9
 
Is putting the dirty bit
 in the tag entry
the best approach?
 
Queried by many operations
and optimizations
Outline
Introduction
Shortcomings of Block-Oriented Organization
The Dirty-Block Index (DBI)
Optimizations Enabled by DBI
Evaluation
Conclusion
10
DRAM-Aware Writeback
11
Last-Level
Cache
Memory
Controller
DRAM
Channel
Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS-2010-2] 
DRAM-Aware Writeback
12
 
Dirty Block
 
Proactively write back
all other dirty blocks from
the same DRAM row
Last-Level
Cache
 
Significantly increases the DRAM write row hit rate
 
Get all dirty blocks of DRAM row ‘R’
Memory
Controller
R
Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS-2010-2] 
Shortcoming of Block-Oriented Organization
13
 
Get all dirty blocks of DRAM row ‘R’
14
Get all dirty blocks of DRAM row ‘R’
Cache
Tag Store
 
Is block 1 of Row R dirty?
 
Is block 2 of Row R dirty?
 
Is block 3 of Row R dirty?
 
Is block 128 of Row R dirty?
 
Shortcoming of Block-Oriented Organization
15
Get all dirty blocks of DRAM row ‘R’
Cache
Tag Store
Shortcoming of Block-Oriented Organization
 
Requires many expensive
(possibly unnecessary) tag lookups
 
Significantly increases
tag store contention
 
Inefficient
Many Cache Optimizations/Operations
16
 
DRAM-aware Writeback
 
Bulk DMA
 
Bypassing Cache Lookup
 
Load Balancing Memory Accesses
 
Cache Flushing
 
DRAM Write Scheduling
 
Metadata for Dirty Blocks
Queries for the Dirty Bit Information
17
 
Get all dirty blocks that belong
to a coarse-grained region
 
Is block ‘B’ dirty?
Block-based dirty bit organization is
inefficient for both queries
Outline
Introduction
Shortcomings of Block-Oriented Organization
The Dirty-Block Index (DBI)
Optimizations Enabled by DBI
Evaluation
Conclusion
18
The Dirty-Block Index
19
V
Block Address
Cache
Tag Store
Tag Entry
D
DBI
 
DRAM row-oriented organization
of dirty bits
The Dirty-Block Index
20
DBI
DRAM row address
DBI Entry
DBI Semantics
21
A block in the cache is dirty 
if and only if
1. The DBI has a valid entry for the DRAM row
that contains the block, and
2. The dirty bit for the block in the bit vector
of the corresponding DBI entry is set
DBI Semantics by Example
22
DBI
100
DBI Entry
 
Dirty Block
 
Even if it is present in
the cache, it is not dirty.
DRAM row address
Dirty bit vector
(one bit per block)
Benefits of DBI
23
Get all dirty blocks of DRAM row ‘R’
Is block ‘B’ dirty?
 
A single lookup to Row R in the DBI
 
DBI is faster than the tag store
 
Compared to 128 lookups with existing organization
Outline
Introduction
Shortcomings of Block-Oriented Organization
The Dirty-Block Index (DBI)
Optimizations Enabled by DBI
Evaluation
Conclusion
24
DRAM-Aware Writeback
25
1
 
Proactively write back
all other dirty blocks from
the same DRAM row
Last-Level
Cache
 
Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS-2010-2]
DBI achieves the benefit of DRAM-aware writeback
without
 increasing contention for the tag store!
Bypassing Cache Lookups
26
2
Cache
Tag Store
 
If an access is likely to miss, we can bypass the tag lookup!
Miss
Predictor
Dirty Block
DBI
 
No
 
1. No false negatives
 
2. Write through
 
Mostly-No Monitors [HPCA 2003], SkipCache [PACT 2012]
 
Reduces access latency/energy; Reduces tag store contention
 
Not desirable
DBI seamlessly enables
simpler and more aggressive 
miss predictors!
Reducing ECC Overhead
27
3
ECC-Cache [IAS 2009], Memory-mapped ECC [ISCA 2009], ECC-FIFO [SC 2009] 
Dirty block – Requires error correction
Clean block – Requires only error detection
Dirty
 
Cache
ECC
EDC
Reducing ECC Overhead
28
3
 
Cache
EDC
DBI
ECC
DBI enables a 
simpler mechanism to reduce ECC cost.
8% reduction in overall cache area!
ECC-Cache [IAS 2009], Memory-mapped ECC [ISCA 2009], ECC-FIFO [SC 2009] 
Dirty block – Requires error correction
Clean block – Requires only error detection
DBI – Other Optimizations
Load balancing memory accesses in hybrid memory
Better DRAM write scheduling
Fast cache flushing
Bulk DMA coherence
29
(Discussed in paper)
Outline
Introduction
Shortcomings of Block-Oriented Organization
The Dirty-Block Index (DBI)
Optimizations Enabled by DBI
Evaluation
Conclusion
30
Evaluation Methodology
2.67 GHz, single issue, OoO, 128-entry instruction window
Cache Hierarchy
32 KB private L1 cache, 256 KB private L2 cache
2MB/core Shared L3 cache
DDR3-1066 DRAM
1 channel, 1 rank, 8 banks, 8KB row buffer, FR-FCFS, open row policy
SPEC CPU2006, STREAM
Multi-core
102 2-core, 259 4-core, and 120 8-core workloads
Multiple metrics for performance and fairness
31
Mechanisms
 
Dynamic Insertion Policy (
Baseline
) 
(ISCA 2007, PACT 2008)
DRAM-Aware Writeback (
DAWB
) 
(TR-HPS-2010-2 UT Austin)
Virtual Write Queue 
(ISCA 2010)
Skip Cache 
(PACT 2012)
Dirty-Block Index
+ No Optimization
+ Aggressive Writeback
+ Cache Lookup Bypass
+ Both Optimizations (
DBI+Both
)
32
Effect on Writes and Tag Lookups
33
DBI achieves almost all the benefits of DAWB
with significantly lower tag store contention
System Performance
34
13%
0%
23%
4%
35%
6%
28%
6%
Reduced tag store contention due to DBI
translates to significant performance improvement
Other Results in Paper
 
Detailed cache area analysis (with and without ECC)
DBI power consumption analysis
Effect of individual optimizations
Other multi-core performance/fairness metrics
Sensitivity to DBI parameters
Sensitivity to cache size/replacement policy
35
Conclusion
 
The Dirty-Block Index
Key Idea: DRAM-row oriented dirty-bit organization
Enables efficient implementation of several optimizations
DRAM-Aware writeback, cache lookup bypass, Reducing ECC cost
28% performance over baseline, 6% over best previous work
8% reduction in overall cache area
Wider applicability
Can be applied to other caches
Can be applied to other metadata (e.g., coherence)
36
The Dirty-Block Index
Vivek Seshadri
Abhishek
 
Bhowmick
 
 
Onur
 
Mutlu
Phillip B. Gibbons
 
 
Michael A. Kozuch
 
 
Todd
 
C.
 
Mowry
Backup Slides
 
38
Cache Coherence
39
M
O
E
S
I
Exclusive modified
Shared modified
Exclusive unmodified
Shared Unmodified
Invalid
D
Operation of a Cache with DBI
40
Cache
Tag Store
DBI
 
1. Read Access
 
2. Writeback
 
3. Cache Eviction
 
4. DBI Eviction
 
Look up tag store
 
Update tag store. Update DBI
to indicate the block is dirty.
 
Check DBI. Write back
if block is dirty
 
Write back all blocks
marked dirty by the entry
DBI Design Parameters
41
DBI
 
DBI Granularity (g)
Number of blocks tracked by each entry
DBI Design Parameters – Example
42
1MB Cache
64B Blocks
DBI
α = ¼
g = 64
Effect on Writes and Tag Lookups
43
System Performance
44
Slide Note
Embed
Share

The Dirty-Block Index (DBI) is a solution to address inefficiencies in caches by removing dirty bits from cache tag stores, improving query response efficiency, and enabling various optimizations like DRAM-aware writeback. Its implementation leads to significant performance gains and cache area reduction compared to baseline and state-of-the-art approaches.

  • Cache Management
  • Efficiency
  • Optimization
  • DRAM
  • Performance

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick Onur Mutlu Phillip B. Gibbons Michael A. Kozuch Todd C. Mowry

  2. Summary Problem: Dirty bit organization in caches does not match queries Inefficiency and performance loss The Dirty-Block Index (DBI) Remove dirty bits from cache tag store DRAM row-oriented organization of dirty bits Efficiently respond to queries Get all dirty blocks of a DRAM row; Is block B dirty? Enables efficient implementation of many optimizations DRAM-aware writeback, bypassing cache lookup, reducing ECC cost, Improves performance while reducing overall cache area 28% performance over baseline, 6% over state-of-the-art (8-core) 8% cache area reduction 2 The Dirty-Block Index

  3. Information: Organization and Query Organization Query ? Get all files between 2013 and 2014. ? ? Get all the files belonging to males with first name starting with Q . ? ? Mismatch leads to inefficiency 3 The Dirty-Block Index

  4. Mismatch between Organization and Query Sorted by title Get all the books written by author X A B C Bad organization for the query Z 4 The Dirty-Block Index

  5. Metadata: Information About a Cache Block Sharing Status (Multi-cores) Replacement Policy (Set-associative cache) Block Address V D Sh Repl ECC Error Correction (Reliability) Valid Bit Dirty Bit (Writeback cache) 5 The Dirty-Block Index

  6. Block-Oriented Metadata Organization Sharing Status (Multi-cores) Replacement Policy (Set-associative cache) Block Address V D Sh Repl ECC Error Correction (Reliability) Valid Bit Dirty Bit (Writeback cache) 6 The Dirty-Block Index

  7. Block-Oriented Metadata Organization Block Address V D Sh Repl ECC Simple to Implement Scalable Tag Entry Cache Tag Store Any metadata query requires an expensive tag store lookup Is this the best organization? 7 The Dirty-Block Index

  8. Block-Oriented Metadata Organization Block Address V D Sh Repl ECC Simple to Implement Scalable Tag Entry Cache Tag Store Any metadata query requires an expensive tag store lookup Is this the best organization? 8 The Dirty-Block Index

  9. Focus of This Work Block Address V D D Sh Repl ECC Tag Entry Dirty Bit Queried by many operations and optimizations Cache Tag Store Is putting the dirty bit in the tag entry the best approach? 9 The Dirty-Block Index

  10. Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 10 The Dirty-Block Index

  11. DRAM-Aware Writeback Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS-2010-2] Memory Controller Channel Last-Level Cache Write Buffer DRAM Row Buffer 1. Buffer writes and flush them in a burst 2. Row buffer hits are faster and more efficient than row misses 11 The Dirty-Block Index

  12. DRAM-Aware Writeback Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS-2010-2] Dirty Block Memory Controller Proactively write back all other dirty blocks from the same DRAM row Last-Level Cache R R R R R Significantly increases the DRAM write row hit rate Get all dirty blocks of DRAM row R 12 The Dirty-Block Index

  13. Shortcoming of Block-Oriented Organization Get all dirty blocks of DRAM row R 13 The Dirty-Block Index

  14. Shortcoming of Block-Oriented Organization Get all dirty blocks of DRAM row R Set of blocks co-located in DRAM ~8KB = 128 cache blocks Is block 1 of Row R dirty? Is block 2 of Row R dirty? Cache Tag Store Is block 3 of Row R dirty? Is block 128 of Row R dirty? 14 The Dirty-Block Index

  15. Shortcoming of Block-Oriented Organization Get all dirty blocks of DRAM row R Requires many expensive (possibly unnecessary) tag lookups Inefficient Cache Tag Store Significantly increases tag store contention 15 The Dirty-Block Index

  16. Many Cache Optimizations/Operations DRAM-aware Writeback Bulk DMA Cache Flushing DRAM Write Scheduling Bypassing Cache Lookup Metadata for Dirty Blocks Load Balancing Memory Accesses 16 The Dirty-Block Index

  17. Queries for the Dirty Bit Information DRAM-aware Writeback Get all dirty blocks that belong to a coarse-grained region Bulk DMA Cache Flushing DRAM Write Scheduling Bypassing Cache Lookup Metadata for Dirty Blocks Load Balancing Memory Accesses Is block B dirty? 17 The Dirty-Block Index

  18. Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 18 The Dirty-Block Index

  19. The Dirty-Block Index Block Address V D Sh Repl ECC Tag Entry DBI Cache Tag Store DRAM row-oriented organization of dirty bits 19 The Dirty-Block Index

  20. The Dirty-Block Index Block Address V Sh Repl ECC Tag Entry DBI DBI Entry Cache Tag Store V D D D D DRAM row address Dirty bit vector (one bit per block) DBI entry valid bit 20 The Dirty-Block Index

  21. DBI Semantics A block in the cache is dirty if and only if 1. The DBI has a valid entry for the DRAM row that contains the block, and 2. The dirty bit for the block in the bit vector of the corresponding DBI entry is set 21 The Dirty-Block Index

  22. DBI Semantics by Example Dirty Block DBI Even if it is present in the cache, it is not dirty. DBI Entry 100 1 0 1 0 0 DRAM row address Dirty bit vector (one bit per block) DBI entry valid bit 22 The Dirty-Block Index

  23. Benefits of DBI Get all dirty blocks of DRAM row R A single lookup to Row R in the DBI Compared to 128 lookups with existing organization Is block B dirty? DBI is faster than the tag store 23 The Dirty-Block Index

  24. Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 24 The Dirty-Block Index

  25. DRAM-Aware Writeback 1 Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS-2010-2] Dirty Block Proactively write back all other dirty blocks from the same DRAM row Last-Level Cache R 1 1 0 0 0 1 0 1 0 DBI Look up the cache only for these blocks 25 The Dirty-Block Index

  26. Bypassing Cache Lookups 2 Mostly-No Monitors [HPCA 2003], SkipCache [PACT 2012] If an access is likely to miss, we can bypass the tag lookup! Reduces access latency/energy; Reduces tag store contention No Miss Predictor Read Cache Tag Store Yes Not desirable Dirty Block DBI 1. No false negatives Yes 2. Write through No Forward to next level 26 The Dirty-Block Index

  27. Reducing ECC Overhead 3 ECC-Cache [IAS 2009], Memory-mapped ECC [ISCA 2009], ECC-FIFO [SC 2009] Dirty block Requires error correction Clean block Requires only error detection Dirty ECC for dirty blocks in some other structure. Complex mechanism to identify location of ECC. ECC EDC Cache 27 The Dirty-Block Index

  28. Reducing ECC Overhead 3 ECC-Cache [IAS 2009], Memory-mapped ECC [ISCA 2009], ECC-FIFO [SC 2009] Dirty block Requires error correction Clean block Requires only error detection tracks far fewer blocks than the cache! EDC ECC DBI Cache 28 The Dirty-Block Index

  29. DBI Other Optimizations Load balancing memory accesses in hybrid memory Better DRAM write scheduling Fast cache flushing Bulk DMA coherence (Discussed in paper) 29 The Dirty-Block Index

  30. Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 30 The Dirty-Block Index

  31. Evaluation Methodology 2.67 GHz, single issue, OoO, 128-entry instruction window Cache Hierarchy 32 KB private L1 cache, 256 KB private L2 cache 2MB/core Shared L3 cache DDR3-1066 DRAM 1 channel, 1 rank, 8 banks, 8KB row buffer, FR-FCFS, open row policy SPEC CPU2006, STREAM Multi-core 102 2-core, 259 4-core, and 120 8-core workloads Multiple metrics for performance and fairness 31 The Dirty-Block Index

  32. Mechanisms Dynamic Insertion Policy (Baseline) (ISCA 2007, PACT 2008) DRAM-Aware Writeback (DAWB) (TR-HPS-2010-2 UT Austin) Virtual Write Queue (ISCA 2010) Skip Cache (PACT 2012) Dirty-Block Index + No Optimization + Aggressive Writeback + Cache Lookup Bypass + Both Optimizations (DBI+Both) Difficult to combine 32 The Dirty-Block Index

  33. Effect on Writes and Tag Lookups 3.0 Baseline DAWB DBI+Both Normalized to Baseline 2.5 2.0 1.5 1.0 0.5 0.0 Memory Writes Write Row Hits Tag Lookups 33 The Dirty-Block Index

  34. System Performance Baseline DAWB DBI+Both 4.0 28% 6% 3.5 System Performance 3.0 35% 6% 2.5 23% 4% 2.0 1.5 13% 0% 1.0 0.5 0.0 1-Core 2-Core 4-Core 8-Core 34 The Dirty-Block Index

  35. Other Results in Paper Detailed cache area analysis (with and without ECC) DBI power consumption analysis Effect of individual optimizations Other multi-core performance/fairness metrics Sensitivity to DBI parameters Sensitivity to cache size/replacement policy 35 The Dirty-Block Index

  36. Conclusion The Dirty-Block Index Key Idea: DRAM-row oriented dirty-bit organization Enables efficient implementation of several optimizations DRAM-Aware writeback, cache lookup bypass, Reducing ECC cost 28% performance over baseline, 6% over best previous work 8% reduction in overall cache area Wider applicability Can be applied to other caches Can be applied to other metadata (e.g., coherence) 36 The Dirty-Block Index

  37. The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick Onur Mutlu Phillip B. Gibbons Michael A. Kozuch Todd C. Mowry

  38. Backup Slides 38 The Dirty-Block Index

  39. Cache Coherence Exclusive unmodified Shared Unmodified Invalid D M O E S I Exclusive modified Shared modified 39 The Dirty-Block Index

  40. Operation of a Cache with DBI 3. Cache Eviction Check DBI. Write back if block is dirty 1. Read Access Look up tag store Cache Tag Store DBI 2. Writeback Update tag store. Update DBI to indicate the block is dirty. 4. DBI Eviction Write back all blocks marked dirty by the entry 40 The Dirty-Block Index

  41. DBI Design Parameters DBI Granularity (g) Number of blocks tracked by each entry R 1 1 0 0 0 1 0 1 0 DBI Size ( ) Total number of blocks tracked by the DBI Represented as a fraction of number of blocks in cache DBI 41 The Dirty-Block Index

  42. DBI Design Parameters Example Cache tracks 16384 blocks DBI tracks 4096 blocks Each entry tracks 64 blocks DBI has 64 entries 1MB Cache 64B Blocks DBI = g = 64 42 The Dirty-Block Index

  43. Effect on Writes and Tag Lookups 3 Baseline DAWB DBI +AWB +CLB +Both Normalized to Baseline 2.5 2 1.5 1 0.5 0 Memory Writes Write Row Hits Tag Lookups 43 The Dirty-Block Index

  44. System Performance Baseline DAWB DBI +AWB +CLB +Both 4.0 3.5 System Performance 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1-Core 2-Core 4-Core 8-Core 44 The Dirty-Block Index

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#