Enhancing TLB Prefetching for Address Translation Performance

Slide Note
Embed
Share

This study explores methods to improve TLB prefetching efficiency by leveraging page table locality, presenting two novel approaches - Sampling-based Free TLB Prefetching (SBFP) and Agile TLB Prefetcher (ATP). These techniques focus on optimizing TLB prefetching mechanisms without disrupting the virtual memory subsystem, aiming to reduce address translation overheads due to data accesses. The study emphasizes the need for tailored TLB prefetchers to adapt to diverse workload characteristics and mitigate unnecessary prefetches triggered by irregular TLB miss behavior.


Uploaded on Apr 02, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. georgios.vavouliotis@bsc.es Exploiting Page Table Locality for Agile TLB Prefetching Georgios Vavouliotis1,2 Lluc Alvarez1,2 Vasileios Karakostas3 Konstantinos Nikas3 Nectarios Koziris3 Daniel A. Jim nez4 Marc Casas1,2 1Barcelona Supercomputing Center 2Universitat Politecnica de Catalunya 3National Technical University of Athens 4Texas A&M University 1

  2. Address Translation Performance Bottleneck Main Memory PA Demand Page Walk PML4E Page Table PDE Cache Hierarchy Core PTE PDPE load VA TLB miss TLB TLB hit VA > PA 2

  3. Address Translation Performance Bottleneck Main Memory PA PML4E Page Table PDE Cache Hierarchy Core PTE PDPE TLB PSC L1 iTLB L1 dTLB L2 TLB 3

  4. Executive Summary Problem Address translation overheads due to data accesses Our approach > TLB Prefetching Operates on the microarchitectural level Relies on the memory access patterns of the application Does not disrupt the virtual memory subsystem Contributions Sampling-based Free TLB Prefetching (SBFP) Exploit page table locality to enhance TLB prefetching Agile TLB Prefetcher (ATP) Novel composite TLB prefetching scheme 4

  5. Sampling-based Free TLB Prefetching (SBFP) PTE of virtual page 0xA3 cache line x86-64 architectures PTE size = 8 bytes 0xA0 0xA1 0xA2 0xA3 0xA4 0xA5 0xA6 0xA7 Cache line size = 64 bytes 7 PTEs can prefetched for free C-7 C-6 C-5 C-4 C-3 C-2 C-1 C+1 C+2 C+3 C+4 C+5 C+6 C+7 Cache Hierarchy Free Distance Table (FDT) p PTE of 0xA0 p Threshold Th PTE of 0xA1 p 8 bytes PTE of 0xA2 saturating counters p PTE of 0xA3 p p p p PTE of 0xA4 Prefetch Queue Sampler PTE of 0xA5 PTE of 0xA6 physical page free distance free distance 1 -3 4 2 virtual page virtual page PTE of 0xA7 64 bytes 0xA4 0xA0 0xA5 0xA7 0xA1 0xA2 0xF1 0xG2 -2 -1 >Th <Th SBFP can be combined with any TLB prefetching scheme to exploit page table locality for both demand and prefetch page walks 0xA6 0xF6 3 5

  6. Agile TLB Prefetcher (ATP) Analysis Findings C0 There is no state-of-the-art TLB prefetcher that performs best across all workloads Different workloads correlate well with different features When the TLB miss behaviour is irregular, prior TLB prefetchers issue useless prefetches Disable Pref. ATP Overview C1 Combine three low-cost TLB prefetchers Adaptive selection and throttling mechanisms Main Memory STP C2 Cache Hierarchy Core FPQ TLB Prefetch Queue MASP H2P FPQ FPQ ATP STP H2P MASP C1 C2 C0 FPQ FPQ FPQ FPQ = Fake Prefetch Queue 6

  7. Simulation Infrastructure & Workloads ChampSim1 Trace-driven multi-core out-of-order simulator Component Parameters L1 I-TLB 64-entry, 8-way x86 page table walker L1 D-TLB 64-entry, 8-way L2 TLB 1536-entry, 12-way SPEC CPU 2006 benchmark suite Page Structure Caches 3-level Split PSC, PML4: 2-entry, PDP: 4-entry, PD: 32-entry SPEC CPU 2017 benchmark suite Prefetch Queue 64-entry, fully assoc. Industrial workloads provided by Qualcomm for Championship Value Prediction (CVP-1) Sampler 64-entry, fully assoc. L1 iCache 32KB, 8-way Big Data workloads L1 dCache 32KB, 8-way GAP benchmark suite L2 Cache 256KB, 8-way XSBench LLC 2MB/core, 16-way SimPoint methodology DRAM 4GB, tRP=tRCD=tCAS=11 L2-TLB MPKI > 1 1https://github.com/ChampSim/ChampSim 7

  8. Performance Comparison Best Performing Prior TLB Prefetcher (per benchmark suite) ATP+SBFP 20 16.2% 15 speedup (%) 11.8% 11.1% 8.7% 10 4.2% 3.4% 5 0 Qualcomm SPEC Big Data State-of-the-art TLB prefetchers1 Sequential Prefetcher (SP) Arbitrary Stride Prefetcher (ASP) Distance Prefetcher (DP) 1G. B. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-driven Study", ISCA'02 8

  9. Conclusions This work reduces the address translation overheads via TLB prefetching microarchitectural technique | relies on the memory access patterns | non disruptive Our proposal Sampling-based Free TLB Prefetching (SBFP) Exploit page table locality to enhance the performance of prior and novel TLB prefetchers Agile TLB Prefetcher (ATP) Composite TLB prefetcher that combines three low-cost TLB prefetchers and disables prefetching when the TLB miss stream is irregular Combining ATP with SBFP improves geomean performance by more than 10% across different benchmark suites and reduces most of the page walk references to the memory hierarchy 9

  10. Thank you! georgios.vavouliotis@bsc.es 10

Related


More Related Content