Amoeba Cache: Adaptive Blocks for Memory Hierarchy Optimization
The Amoeba Cache introduces adaptive blocks to optimize memory hierarchy utilization, eliminating waste by dynamically adjusting storage allocations. Factors influencing cache efficiency and application-specific behaviors are explored. Images and data distributions illustrate the effectiveness of this innovative approach in improving cache utilization.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao Sandhya Dwarkadas
On-chip Storage Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 2
Fixed granularity cache Tag Array Data Array Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 3
Cache data utilization Tag Array Data Array Utilization = Fraction of words touched in Untouched Data Tags cache block at the time of eviction Data Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 4
Cache utilization 100% 64K L1 4 ways 64B/block 75% 50% apache 25% eclipse firefox cann. x264 tpcc lbm mcf jbb h2 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 5
Block Distribution 64K 64B/block 26% 25% Apache # Words Touched Firefox 40% 6% 55% 9% 13% 1-2 26% 3-4 6%5% 18% Canneal Eclipse 5-6 5% 4% 14% 7-8 73% 75% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 6
Block Distribution 1M 64B/block 64K 64B/block # Words Touched 6%5% 10% Canneal Canneal 1-2 12% 14% 3-4 58% 20% 75% 5-6 7-8 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 7
Factors affecting cache utilization Application specific behaviour Inefficient data structure access patterns Interaction with cache geometry Way conflicts reduce block lifetime and cause poor utilization Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 8
Application Specific Behaviour struct TIE { } Imperial[1024]; for (int i=0; i<1024; i++) { Imperial[i].X = ; Imperial[i].Y = ; Imperial[i].Z = ; Imperial[i].V = ; } Data Array long long X, Y, Z; long long V, H; long long data[3]; Access in a loop X Y Z V H Data[3] Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 9
Cache Geometry Data Array 4 ways 3 2 1 5 4 Problem : Lots of data map to same set Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 10
Implications = 1. Shrinks effective cache space 2. Increases miss rate 3. Wastes on-chip bandwidth 4. Increases on-chip cache energy consumption Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 11
Target Metrics Bandwidth Amoeba Cache Space Utilisation Miss Rate Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 12
Variable Granularity Blocks Tag Array Data Array How to support variable # of blocks / set ? How to support variable granularity for each block? Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 13
Our Approach : Amoeba Cache Unified SRAM Array Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 14
Amoeba Cache Insert Lookup Partial Miss Overheads Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 15
SRAM Array Bitmaps Valid? Tag? SRAM Array 0000 0000 0000 0000 0000 0000 0000 0000 Tag Data Block Region Tag Start End 1+ words 1 word Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 16
Tag - Regions RMAX bytes Memory Region Top 3 3 Start / End Region Tag Set Index Byte 64 bit address Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 17
Example Imperial.X = ; struct TIE { } Imperial; (PC/Region based) Miss long long X, Y, Z; long long V, H; long long data[3]; Invoke Spatial Granularity Predictor Fetch Tag X Y Z V Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 18
Amoeba Cache Insert (8words/set) Insert 4+1 words Tag? 00000000 00000000Valid? 00000 substring() 1 Pos: 0 SRAM Array / Set Miss Tag X Y Z V Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 19
Amoeba Cache Insert (8words/set) Tag? Valid? 2 00000000 11111000 00000000 100000003 SRAM Array / Set Tag X Y Z V Refill Tag X Y Z V Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 20
Example struct TIE { } Imperial; long long X, Y, Z; long long V, H; long long data[3]; Tag X Y Z V Imperial.Y = ; Lookup Data from the cacheData[3] Z V Z V X X Y Y H Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 21
Amoeba Cache Lookup (8words/set) SRAM Array / Set Tag X Y Z V 1 10000000 Tag? V Output Buffer Tag X Y Z Critical Path Region Tag Set Index Word (W) 2x1 2x1 2x1 2x1 2 ???? ??? Start W Hit? Region == Word Selector End > W 3 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 22
Partial Miss Identify Sub-Blocks Step 1 of 2 Fetch New 1 New Tags Tag X Y Z V Tag X Y Tag V H 2 Evict Overlap MSHR Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 23
Partial Miss Insert New Block Step 2 of 2 Allocate 6 words X Y 3 Tag Z V H MSHR X Y ? Z V H 4 Patch Missing ? s Occurs 5 in 1000 accesses 5 Miss Tag Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 24
Hardware Overheads Metadata Valid? Tag? SRAM Array 0000 0000 Critical Path Amoeba Critical Path 0000 0000 0000 0000 1 KB Extra Latency +4% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 25
Evaluation Parameters for latency and energy Workloads Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 26
Latency Parameters (cycles) 1 1.04 Latency +4% CPU Fixed Granularity Amoeba Cache 3 64K L1 20 1M LLC Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 300 27
On-Chip Energy Parameters (pJ) 101 105 Fixed Granularity 64K L1 Amoeba Cache 7 / word 230 238 1M LLC Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 28
Workloads 22 diverse workloads from PARSEC SPEC-CPU 2000 & 2006 DaCapo ( Java Benchmarks ) Apache, Firefox and PostgreSQL Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 29
Results Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 30
% Improvement in L1 Miss-Rate 40% 30% 20% Reduces L1 and L2 miss rate by 18% 10% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 31
% Improvement in L1 Miss-Bandwidth 75% 50% Reduces on-chip bandwidth by 46% Reduces off-chip bandwidth by 38% 25% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 32
% Improvement in memory energy 40% 30% 20% Reduces energy by 11% 10% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 33
% Improvement in execution time 20% 21% 15% 10% Improves performance by 10% 5% 0% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 34
Results Summary Amoeba-Cache Reduce cache pollution for applications with low cache utilization Improve performance for moderate cache utilization Maintain performance for high cache utilization workloads Save energy for streaming applications by keeping out unused words Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 35
Additional Results Lookup as an extra cache pipeline stage vs. throttling the CPU applications show improvement For extra pipeline stage, 8 of 22 Spatial Granularity Predictor Indexing Training Table Size 256 PC and 1024 Region 18 of 22 Address region better Evictions and First Touch Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 36
Additional Results Reduces miss rate (avg 18%) and LLC Multicore Shared Cache miss bandwidth (16%-39%) Comparison against other designs Fixed Granularity 2X Sector Cache variants Multi-$ Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 37
Amoeba Cache What? Enable variable granularity data caching Why? Eliminate waste How? Unify tag and data into a single SRAM array Afforded by recent technology trends Where? Definitely at the L2, possibly at the L1 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 38
Frequently Asked Questions 1. Multiple threads? 2. Compare against other designs 3. Spatial Pattern Predictor 4. Replacement Policy Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 39
Multicore Shared Cache Miss Miss Miss Miss BW Mix T1 T2 T3 T4 (All) jbb x2, tpc-c x2 12.38% 12.38% 22.29% 22.37% 39.07% Firefox x2, x264 x2 3.82% 3.61% 2.44% 0.43% 15.71% cactus, fluid., omnet., sopl. 1.01% 1.86% 22.38% 0.59% 18.62% canneal, astar, ferret, milc 4.85% 2.75% 19.39% 4.07% 17.77% Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 40
Comparison Multi -$ Sector Variants Amoeba Cache Impact on Miss-Rate Impact on Bandwidth Low tag overhead Tradeoff data and tag space Dynamically resize blocks ~ No No No Yes No No ~ Yes Yes Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 41
Comparison Moderate Group 64K 1.0 Fixed-2X 0.9 Sector (x:2.9) Bandwidth Ratio 0.8 Amoeba 0.7 Multi$-25 0.6 Sector-Pre Multi$-50 0.5 0.4 1.0 1.1 1.2 Miss Rate Ratio 1.3 1.4 1.5 1.6 Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 42
Spatial Pattern Predictor Predictor History Table Index Pattern 01011111 PC / Region 00011101 PC / Region 2 What to do when there is no entry? 1 0 0 0 1 1 1 0 1 PC : Read Addr Critical Word Policy Miss vs Policy-Bandwidth Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 43
Predictor Training Data Array Add / update entry on evict Index Pattern 01011111 PC / Region 00011101 PC / Region Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 44
Predictor L1 Miss Rate (1 of 2) Aligned Finite Infinite Finite+FT History 10 8 6 MPKI 4 2 0 h2 canne. x264 tpc-c eclip. firef. Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 45
Predictor L1 Miss Rate (2 of 2) Aligned Finite Infinite Finite+FT History 140 120 100 80 MPKI 60 40 20 0 mcf apac. lbm jbb Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 46
Predictor L1 Miss Bandwidth (1 of 2) Aligned Finite Infinite Finite+FT History 1800 1500 Bandwidth Rate 1200 900 600 300 0 h2 canne. firef. x264 tpc-c eclip. Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 47
Predictor L1 Miss Bandwidth (2 of 2) Aligned Finite Infinite Finite+FT History 10000 8000 Bandwidth Rate 6000 4000 2000 0 mcf apac. lbm jbb Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 48
Predictor Summary For majority applications Region Predictor with 1024 entry table Table with 8 ways x 128 sets PC Predictor is good for 5 applications apache, art, mcf, lbm and omnetpp Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 49
Pseudo LRU Replacement Way 0 Way 1 Pick a block at random from way Unset the T? (Tag) and V? (Valid) bits Logically partition the set into a Nways Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy 50