Amoeba Cache: Adaptive Blocks for Memory Hierarchy Optimization

Slide Note
Embed
Share

The Amoeba Cache introduces adaptive blocks to optimize memory hierarchy utilization, eliminating waste by dynamically adjusting storage allocations. Factors influencing cache efficiency and application-specific behaviors are explored. Images and data distributions illustrate the effectiveness of this innovative approach in improving cache utilization.


Uploaded on Sep 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao Sandhya Dwarkadas

  2. On-chip Storage Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 2

  3. Fixed granularity cache Tag Array Data Array Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 3

  4. Cache data utilization Tag Array Data Array Utilization = Fraction of words touched in Untouched Data Tags cache block at the time of eviction Data Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 4

  5. Cache utilization 100% 64K L1 4 ways 64B/block 75% 50% apache 25% eclipse firefox cann. x264 tpcc lbm mcf jbb h2 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 5

  6. Block Distribution 64K 64B/block 26% 25% Apache # Words Touched Firefox 40% 6% 55% 9% 13% 1-2 26% 3-4 6%5% 18% Canneal Eclipse 5-6 5% 4% 14% 7-8 73% 75% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 6

  7. Block Distribution 1M 64B/block 64K 64B/block # Words Touched 6%5% 10% Canneal Canneal 1-2 12% 14% 3-4 58% 20% 75% 5-6 7-8 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 7

  8. Factors affecting cache utilization Application specific behaviour Inefficient data structure access patterns Interaction with cache geometry Way conflicts reduce block lifetime and cause poor utilization Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 8

  9. Application Specific Behaviour struct TIE { } Imperial[1024]; for (int i=0; i<1024; i++) { Imperial[i].X = ; Imperial[i].Y = ; Imperial[i].Z = ; Imperial[i].V = ; } Data Array long long X, Y, Z; long long V, H; long long data[3]; Access in a loop X Y Z V H Data[3] Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 9

  10. Cache Geometry Data Array 4 ways 3 2 1 5 4 Problem : Lots of data map to same set Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 10

  11. Implications = 1. Shrinks effective cache space 2. Increases miss rate 3. Wastes on-chip bandwidth 4. Increases on-chip cache energy consumption Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 11

  12. Target Metrics Bandwidth Amoeba Cache Space Utilisation Miss Rate Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 12

  13. Variable Granularity Blocks Tag Array Data Array How to support variable # of blocks / set ? How to support variable granularity for each block? Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 13

  14. Our Approach : Amoeba Cache Unified SRAM Array Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 14

  15. Amoeba Cache Insert Lookup Partial Miss Overheads Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 15

  16. SRAM Array Bitmaps Valid? Tag? SRAM Array 0000 0000 0000 0000 0000 0000 0000 0000 Tag Data Block Region Tag Start End 1+ words 1 word Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 16

  17. Tag - Regions RMAX bytes Memory Region Top 3 3 Start / End Region Tag Set Index Byte 64 bit address Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 17

  18. Example Imperial.X = ; struct TIE { } Imperial; (PC/Region based) Miss long long X, Y, Z; long long V, H; long long data[3]; Invoke Spatial Granularity Predictor Fetch Tag X Y Z V Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 18

  19. Amoeba Cache Insert (8words/set) Insert 4+1 words Tag? 00000000 00000000Valid? 00000 substring() 1 Pos: 0 SRAM Array / Set Miss Tag X Y Z V Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 19

  20. Amoeba Cache Insert (8words/set) Tag? Valid? 2 00000000 11111000 00000000 100000003 SRAM Array / Set Tag X Y Z V Refill Tag X Y Z V Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 20

  21. Example struct TIE { } Imperial; long long X, Y, Z; long long V, H; long long data[3]; Tag X Y Z V Imperial.Y = ; Lookup Data from the cacheData[3] Z V Z V X X Y Y H Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 21

  22. Amoeba Cache Lookup (8words/set) SRAM Array / Set Tag X Y Z V 1 10000000 Tag? V Output Buffer Tag X Y Z Critical Path Region Tag Set Index Word (W) 2x1 2x1 2x1 2x1 2 ???? ??? Start W Hit? Region == Word Selector End > W 3 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 22

  23. Partial Miss Identify Sub-Blocks Step 1 of 2 Fetch New 1 New Tags Tag X Y Z V Tag X Y Tag V H 2 Evict Overlap MSHR Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 23

  24. Partial Miss Insert New Block Step 2 of 2 Allocate 6 words X Y 3 Tag Z V H MSHR X Y ? Z V H 4 Patch Missing ? s Occurs 5 in 1000 accesses 5 Miss Tag Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 24

  25. Hardware Overheads Metadata Valid? Tag? SRAM Array 0000 0000 Critical Path Amoeba Critical Path 0000 0000 0000 0000 1 KB Extra Latency +4% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 25

  26. Evaluation Parameters for latency and energy Workloads Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 26

  27. Latency Parameters (cycles) 1 1.04 Latency +4% CPU Fixed Granularity Amoeba Cache 3 64K L1 20 1M LLC Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 300 27

  28. On-Chip Energy Parameters (pJ) 101 105 Fixed Granularity 64K L1 Amoeba Cache 7 / word 230 238 1M LLC Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 28

  29. Workloads 22 diverse workloads from PARSEC SPEC-CPU 2000 & 2006 DaCapo ( Java Benchmarks ) Apache, Firefox and PostgreSQL Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 29

  30. Results Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 30

  31. % Improvement in L1 Miss-Rate 40% 30% 20% Reduces L1 and L2 miss rate by 18% 10% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 31

  32. % Improvement in L1 Miss-Bandwidth 75% 50% Reduces on-chip bandwidth by 46% Reduces off-chip bandwidth by 38% 25% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 32

  33. % Improvement in memory energy 40% 30% 20% Reduces energy by 11% 10% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 33

  34. % Improvement in execution time 20% 21% 15% 10% Improves performance by 10% 5% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 34

  35. Results Summary Amoeba-Cache Reduce cache pollution for applications with low cache utilization Improve performance for moderate cache utilization Maintain performance for high cache utilization workloads Save energy for streaming applications by keeping out unused words Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 35

  36. Additional Results Lookup as an extra cache pipeline stage vs. throttling the CPU applications show improvement For extra pipeline stage, 8 of 22 Spatial Granularity Predictor Indexing Training Table Size 256 PC and 1024 Region 18 of 22 Address region better Evictions and First Touch Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 36

  37. Additional Results Reduces miss rate (avg 18%) and LLC Multicore Shared Cache miss bandwidth (16%-39%) Comparison against other designs Fixed Granularity 2X Sector Cache variants Multi-$ Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 37

  38. Amoeba Cache What? Enable variable granularity data caching Why? Eliminate waste How? Unify tag and data into a single SRAM array Afforded by recent technology trends Where? Definitely at the L2, possibly at the L1 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 38

  39. Frequently Asked Questions 1. Multiple threads? 2. Compare against other designs 3. Spatial Pattern Predictor 4. Replacement Policy Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 39

  40. Multicore Shared Cache Miss Miss Miss Miss BW Mix T1 T2 T3 T4 (All) jbb x2, tpc-c x2 12.38% 12.38% 22.29% 22.37% 39.07% Firefox x2, x264 x2 3.82% 3.61% 2.44% 0.43% 15.71% cactus, fluid., omnet., sopl. 1.01% 1.86% 22.38% 0.59% 18.62% canneal, astar, ferret, milc 4.85% 2.75% 19.39% 4.07% 17.77% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 40

  41. Comparison Multi -$ Sector Variants Amoeba Cache Impact on Miss-Rate Impact on Bandwidth Low tag overhead Tradeoff data and tag space Dynamically resize blocks ~ No No No Yes No No ~ Yes Yes Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 41

  42. Comparison Moderate Group 64K 1.0 Fixed-2X 0.9 Sector (x:2.9) Bandwidth Ratio 0.8 Amoeba 0.7 Multi$-25 0.6 Sector-Pre Multi$-50 0.5 0.4 1.0 1.1 1.2 Miss Rate Ratio 1.3 1.4 1.5 1.6 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 42

  43. Spatial Pattern Predictor Predictor History Table Index Pattern 01011111 PC / Region 00011101 PC / Region 2 What to do when there is no entry? 1 0 0 0 1 1 1 0 1 PC : Read Addr Critical Word Policy Miss vs Policy-Bandwidth Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 43

  44. Predictor Training Data Array Add / update entry on evict Index Pattern 01011111 PC / Region 00011101 PC / Region Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 44

  45. Predictor L1 Miss Rate (1 of 2) Aligned Finite Infinite Finite+FT History 10 8 6 MPKI 4 2 0 h2 canne. x264 tpc-c eclip. firef. Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 45

  46. Predictor L1 Miss Rate (2 of 2) Aligned Finite Infinite Finite+FT History 140 120 100 80 MPKI 60 40 20 0 mcf apac. lbm jbb Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 46

  47. Predictor L1 Miss Bandwidth (1 of 2) Aligned Finite Infinite Finite+FT History 1800 1500 Bandwidth Rate 1200 900 600 300 0 h2 canne. firef. x264 tpc-c eclip. Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 47

  48. Predictor L1 Miss Bandwidth (2 of 2) Aligned Finite Infinite Finite+FT History 10000 8000 Bandwidth Rate 6000 4000 2000 0 mcf apac. lbm jbb Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 48

  49. Predictor Summary For majority applications Region Predictor with 1024 entry table Table with 8 ways x 128 sets PC Predictor is good for 5 applications apache, art, mcf, lbm and omnetpp Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 49

  50. Pseudo LRU Replacement Way 0 Way 1 Pick a block at random from way Unset the T? (Tag) and V? (Valid) bits Logically partition the set into a Nways Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 50

Related