ECE 454 Computer Systems Programming
This article delves into virtual memory concepts, memory layout in Linux systems, advantages of virtual memory over physical memory, and challenges in memory management. It also discusses the issues of fitting data into memory and memory allocation for different processes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
ECE 454 Computer Systems Programming Memory performance (Part III: Virtual Memory and Prefetching) Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan
Contents Virtual Memory (review (hopefully)) Prefetching 2 Ding Yuan, ECE454 2013-10-14
IA32 Linux Memory Layout Stack Runtime stack (8MB limit) Data Statically allocated data E.g., arrays & strings declared in code Heap Dynamically allocated storage When call malloc(), calloc(), new() Text Executable machine instructions Read-only 3 Ding Yuan, ECE454 2013-10-14
Virtual Memory Programs refer to virtual memory addresses Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to particular process Allocation: Compiler and run-time system Where different program objects should be stored All allocation within single virtual address space But why virtual memory? Why not physical memory? 4 Ding Yuan, ECE454 2013-10-14
A System Using Physical Addressing Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: Physical address (PA) CPU ... M-1: Data word Used in simple embedded microcontrollers Problems for larger, multi-process systems 5 Ding Yuan, ECE454 2013-10-14
Problem 1: How Does Everything Fit? 64-bit addresses: 16 Exabyte Physical main memory: Few Gigabytes ? And there are many processes . 6 Ding Yuan, ECE454 2013-10-14
Problem 2: Memory Management Physical main memory Process 1 Process 2 Process 3 Process n stack heap .text .data x What goes where? 7 Ding Yuan, ECE454 2013-10-14
Problem 3: Portability Machine 1 Machine 2 Physical main memory Physical main memory programA: What goes where? stack heap .text .data What goes where 8 Ding Yuan, ECE454 2013-10-14
Problem 4: How To Protect Physical main memory Process i Process j Problem 5: How To Share? Physical main memory Process i Process j 9 Ding Yuan, ECE454 2013-10-14
Solution: add a Level Of Indirection Virtual memory Process 1 Physical memory mapping Virtual memory Process n Each process gets its own private memory space Solves the previous problems 10 Ding Yuan, ECE454 2013-10-14
A System Using Virtual Addressing Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: CPU Chip Virtual address (VA) Physical address (PA) CPU MMU ... M-1: Data word MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a page table 11 Ding Yuan, ECE454 2013-10-14
VM Turns Main Memory into a Cache Driven by enormous miss penalty: Disk is about 10,000x slower than DRAM DRAM-Cache Design: Large page (block) size: typically 4KB Fully associative Any VP can be placed in any PP Requires a large mapping function different from CPU caches Highly sophisticated, expensive replacement algorithms Too complicated and open-ended to be implemented in hardware Write-back rather than write-through 12 Ding Yuan, ECE454 2013-10-14
MMU Needs Big Table of Translations Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: CPU Chip Virtual address (VA) Physical address (PA) CPU MMU Page Table ... M-1: Data word MMU keeps mapping of VAs -> PAs in a page table 13 Ding Yuan, ECE454 2013-10-14
Reduce Hardware: Page Table in Mem. 2 CPU Chip PTE request 1 Page Table PTE VA MMU CPU 3 Cache/ Memory PA 4 Data 5 1) Processor sends virtual address (VA) to MMU 2-3) MMU requests page table entry (PTE) from page table in memory 4) MMU sends physical address (PA) to cache/memory 5) Cache/memory sends data word to processor If the page is not mapped in physical mem., called a page fault 2013-10-14 14 Ding Yuan, ECE454
Page Fault Exception Page fault handler 4 2 CPU Chip Page Table Victim page PTE request 1 5 VA PTE Cache/ Memory CPU MMU Disk 3 7 New page 6 1) Processor sends virtual address to MMU 2-3) MMU fetches PTE from page table in memory 4) PTE absent so MMU triggers page fault exception 5) Handler identifies victim (and, if dirty, pages it out to disk) 6) Handler pages in new page and updates PTE in memory 7) Handler returns to original process, restarting faulting instruction 15 Ding Yuan, ECE454 2013-10-14
Speeding up Translation with a TLB Page table entries (PTEs) are cached in L1 like any other memory word But PTEs may be evicted by other data references PTE hit still requires a 1-cycle delay Solution: Translation Lookaside Buffer (TLB) Small hardware cache in MMU Caches PTEs for a small number of pages (eg., 256 entries) 16 Ding Yuan, ECE454 2013-10-14
TLB Hit CPU Chip TLB PTE 2 3 VA Page Table 1 PA VA CPU MMU Cache/ Memory 4 Data 5 A TLB hit saves you from accessing memory for the page table 17 Ding Yuan, ECE454 2013-10-14
TLB Miss CPU Chip TLB 4 2 PTE VA Page Table 1 3 VA PTE request CPU MMU Cache/ Memory PA 5 Data 6 A TLB miss incurs an additional memory access (the PTE) 18 Ding Yuan, ECE454 2013-10-14
How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) Thrashing: Performance meltdownwhere pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding 19 Ding Yuan, ECE454 2013-10-14
More on TLBs Assume a 256-entry TLB, 4kB pages 256*4kB = 1MB: can only have TLB hits for 1MB of data This is called the TLB reach ---amount of mem TLB can cover Typical L2 cache is 6MB Hence can t have TLB hits for all of L2 Hence should consider TLB-size before L2 size when tiling? Real CPUs have second-level TLBs This is getting complicated to reason about! Likely have to experiment to find best tile size (HW2) 20 Ding Yuan, ECE454 2013-10-14
Prefetching 21 Ding Yuan, ECE454 2013-10-14
Prefetching ORIGINAL CODE: CODE WITH PREFETCHING: inst1 inst1 inst2 prefetch X inst2 inst3 inst4 inst3 inst4 load X (hits cache) Cache miss latency load X (misses cache) Cache miss latency inst5 (load value is ready) inst6 inst5 (must wait for load value) inst6 Basic idea: Predicts which data will be needed soon (might be wrong) Initiates an early request for that data (like a load-to-cache) If effective, can be used to tolerate latency to memory 22 Ding Yuan, ECE454 2013-10-14
Prefetching is Difficult Prefetching is effective only if all of these are true: There is spare memory bandwidth to begin with Otherwise prefetches could make things worse Prefetches are accurate Only useful if you prefetch data you will soon use Prefetches are timely Ie., prefetch the right data, but not early enough Prefetched data doesn t displace other in-use data Eg: bad if PF replaces a cache block about to be used Latency hidden by prefetches outweighs their cost Cost of many useless prefetches could be significant Ineffective prefetching can hurt performance! 23 Ding Yuan, ECE454 2013-10-14
Hardware Prefetching A simple hardware prefetcher: When one block is accessed prefetch the adjacent block i.e., behaves like blocks are twice as big A more complex hardware prefetcher: Can recognize a stream : addresses separated by a stride Eg1: 0x1, 0x2, 0x3, 0x4, 0x5, 0x6... (stride = 0x1) Eg2: 0x100, 0x300, 0x500, 0x700, 0x900 (stride = 0x200) Prefetch predicted future addresses Eg., current_address + stride*4 24 Ding Yuan, ECE454 2013-10-14
Core 2 Hardware Prefetching L1/L2 cache: 64 B blocks Not drawn to scale L2->L1 inst prefetching 6 MB ~500 GB (?) ~4 GB L1 I-cache L2 Main Memory unified cache 32 KB L1 CPU Reg D-cache Disk Mem->L2 data prefetching L2->L1 data prefetching Includes next-block prefetching and multiple streaming prefetchers They will only prefetch within a page boundary (details are kept vague/secret) 25 Ding Yuan, ECE454 2013-10-14
Software Prefetching Hardware provides special prefetch instructions: Eg., intel s prefetchnta instruction Compiler or programmer can insert them into the code: Can PF patterns that hardware wouldn t recognize (non-strided) void process_list(list_t *head){ list_t *p = head; while (p){ process(p); p = p->next; } } void process_list_PF(list_t *head){ list_t *p = head; list_t *q; while (p){ q = p->next; prefetch(q); process(p); p = q; } } Assumes process() is long enough to hide the prefetch latency 26 Ding Yuan, ECE454 2013-10-14
Summary: Optimizing for a Modern Memory Hierarchy 27 Ding Yuan, ECE454 2013-10-14
Memory Optimization: Summary Caches Conflict Misses: Less of a concern due to high-associativity (8-way L1, 16-way L2) Cache Capacity: Main concern: keep working set within on-chip cache capacity Focus on either L1 or L2 depending on required working-set size Virtual Memory: Page Misses: Keep big-picture working set within main memory capacity TLB Misses: may want to keep working set #pages < TLB #entries Prefetching: Try to arrange data structures, access patterns to favour sequential/strided access Try compiler or manual-inserted prefetch instructions 28 Ding Yuan, ECE454 2013-10-14