ECE 454 Computer Systems Programming

ECE 454
Computer Systems Programming
Memory performance
(Part III: Virtual Memory and Prefetching)
Ding Yuan
ECE Dept., University of Toronto
http://www.eecg.toronto.edu/~yuan
Contents
Virtual Memory (review (hopefully))
Prefetching
2013-10-14
Ding Yuan, ECE454
2
IA32 Linux Memory Layout
Stack
Runtime stack (8MB limit)
Data
Statically allocated data
E.g., arrays & strings declared
in code
Heap
Dynamically allocated storage
When call 
malloc(),
calloc(), new()
Text
Executable machine
instructions
Read-only
2013-10-14
Ding Yuan, ECE454
3
 
Programs refer to virtual memory addresses
Conceptually very large array of bytes (4GB for IA32, 16
exabytes for 64 bits)
Each byte has its own address
System provides address space private to particular “process”
Allocation: Compiler and run-time system
Where different program objects should be stored
All allocation within single virtual address space
But why virtual memory?
Why not physical memory?
Virtual Memory
2013-10-14
Ding Yuan, ECE454
4
A System Using Physical Addressing
Used in “simple” embedded microcontrollers
Problems for larger, multi-process systems
0:
1:
M-1:
Main memory
CPU
2:
3:
4:
5:
6:
7:
Physical address
(PA)
Data word
8:
.
.
.
2013-10-14
Ding Yuan, ECE454
5
Problem 1: How Does Everything Fit?
64-bit addresses:
16 Exabyte
Physical main memory:
Few Gigabytes
?
And there are many processes ….
2013-10-14
Ding Yuan, ECE454
6
Problem 2: Memory Management
Physical main memory
What goes
where?
stack
heap
.text
.data
Process 1
Process 2
Process 3
Process n
x
2013-10-14
Ding Yuan, ECE454
7
Problem 3: Portability
Machine 1
Physical main memory 
What goes
where
programA:
stack
heap
.text
.data
Machine 2
Physical main memory 
What goes
where?
2013-10-14
Ding Yuan, ECE454
8
Problem 4: How To Protect
Physical main memory
Process i
Process j
 
Problem 5: How To Share?
 
Physical main memory
 
Process i
 
Process j
2013-10-14
Ding Yuan, ECE454
9
Solution: add a Level Of Indirection
Each process gets its own private memory space
Solves the previous problems
Physical memory
Virtual memory
Virtual memory
Process 1
Process n
mapping
2013-10-14
Ding Yuan, ECE454
10
A System Using Virtual Addressing
MMU = Memory Management Unit
MMU keeps mapping of VAs -> PAs in a “page table”
0:
1:
M-1:
Main memory
MMU
2:
3:
4:
5:
6:
7:
Physical address
(PA)
Data word
8:
.
.
.
CPU
Virtual address
(VA)
CPU Chip
2013-10-14
Ding Yuan, ECE454
11
VM Turns Main Memory into a Cache
 
Driven by enormous miss penalty:
Disk is about 
10,000x
 slower than DRAM
DRAM-Cache Design:
Large page (block) size: typically 4KB
Fully associative
Any VP can be placed in any PP
Requires a “large” mapping function – different from CPU caches
Highly sophisticated, expensive replacement algorithms
Too complicated and open-ended to be implemented in hardware
Write-back rather than write-through
2013-10-14
Ding Yuan, ECE454
12
MMU Needs Big Table of Translations
MMU keeps mapping of VAs -> PAs in a “page table”
0:
1:
M-1:
Main memory
MMU
2:
3:
4:
5:
6:
7:
Physical address
(PA)
Data word
8:
.
.
.
CPU
Virtual address
(VA)
CPU Chip
Page Table
2013-10-14
Ding Yuan, ECE454
13
Reduce Hardware: Page Table in Mem.
 
1) Processor sends virtual address (VA) to MMU
2-3) MMU requests page table entry (PTE) from page table in memory
4) MMU sends physical address (PA) to cache/memory
5) Cache/memory sends data word to processor
MMU
Cache/
Memory
 
PA
 
Data
CPU
VA
CPU Chip
 
PTE request
 
PTE
1
2
3
4
5
Page Table
If 
If 
the page is not mapped in physical mem.
the page is not mapped in physical mem.
, called a “page fault”
, called a “page fault”
2013-10-14
Ding Yuan, ECE454
14
Page Fault
 
1) Processor sends virtual address to MMU
2-3) MMU fetches PTE from page table in memory
4) PTE absent so MMU triggers page fault exception
5) Handler identifies victim (and, if dirty, pages it out to disk)
6) Handler pages in new page and updates PTE in memory
7) Handler returns to original process, restarting faulting instruction
MMU
Cache/
Memory
CPU
VA
CPU Chip
PTE request
PTE
1
2
3
4
5
Disk
Page fault handler
 
Victim page
 
New page
 
Exception
6
7
Page Table
2013-10-14
Ding Yuan, ECE454
15
Speeding up Translation with a TLB
Page table entries (PTEs) are cached in L1 like any other
memory word
But PTEs may be evicted by other data references
PTE hit still requires a 1-cycle delay
Solution: 
Translation Lookaside Buffer
 (TLB)
Small hardware cache in MMU
Caches PTEs for a small number of pages (eg., 256 entries)
2013-10-14
Ding Yuan, ECE454
16
TLB Hit
MMU
Cache/
Memory
 
PA
 
Data
CPU
VA
CPU Chip
 
PTE
1
2
4
5
A TLB hit saves you from accessing memory for the page table
TLB
VA
3
Page Table
2013-10-14
Ding Yuan, ECE454
17
TLB Miss
MMU
Cache/
Memory
 
PA
 
Data
CPU
VA
CPU Chip
 
PTE
1
2
5
6
TLB
VA
4
 
PTE request
3
A TLB miss incurs an additional memory access (the PTE)
Page Table
2013-10-14
Ding Yuan, ECE454
18
How to Program for Virtual Memory
At any point in time, programs tend to access a set of active
virtual pages called the 
working set
Programs with better temporal locality will have smaller
working sets
If ((working set size) > main mem size)
Thrashing:
 
Performance meltdown
 
where pages are swapped
(copied) in and out continuously
If ((# working set pages) > # TLB entries)
Will suffer TLB misses
Not as bad as page thrashing, but still worth avoiding
2013-10-14
Ding Yuan, ECE454
19
More on TLBs
Assume a 256-entry TLB, 4kB pages
256*4kB = 1MB: can only have TLB hits for 1MB of data
This is called the “TLB reach”---amount of mem TLB can cover
Typical L2 cache is 6MB
Hence can’t have TLB hits for all of L2
Hence should consider TLB-size before L2 size when tiling?
Real CPUs have second-level TLBs
This is getting complicated to reason about!
Likely have to experiment to find best tile size (HW2)
2013-10-14
Ding Yuan, ECE454
20
Prefetching
2013-10-14
Ding Yuan, ECE454
21
Prefetching
Basic idea:
Predicts which data will be needed soon (might be wrong)
Initiates an early request for that data (like a load-to-cache)
If effective, can be used to tolerate latency to memory
load X (misses cache)
inst4
inst3
inst2
inst1
inst6
inst5 (must wait for load value)
Cache miss latency
ORIGINAL CODE:
prefetch X
inst3
inst2
inst4
inst1
inst6
inst5 (load value is ready)
Cache miss latency
CODE WITH PREFETCHING:
load  X (hits cache)
2013-10-14
Ding Yuan, ECE454
22
Prefetching is Difficult
Prefetching is effective only if all of these are true:
There is spare memory bandwidth to begin with
Otherwise prefetches could make things worse
Prefetches are 
accurate
Only useful if you prefetch data you will soon use
Prefetches are 
timely
Ie., prefetch the right data, but not early enough
Prefetched data doesn’t displace other in-use data
Eg: bad if PF replaces a cache block about to be used
Latency hidden by prefetches outweighs their cost
C
ost of many useless prefetches could be significant
Ineffective prefetching can hurt performance!
2013-10-14
Ding Yuan, ECE454
23
Hardware Prefetching
A simple hardware prefetcher:
When one block is accessed prefetch the adjacent block
i.e., behaves like blocks are twice as big
A more complex hardware prefetcher:
Can recognize a “stream”: addresses separated by a “stride”
Eg1: 0x1, 0x2, 0x3, 0x4, 0x5, 0x6...   (stride = 0x1)
Eg2: 0x100, 0x300, 0x500, 0x700, 0x900…  (stride = 0x200)
Prefetch predicted future addresses
Eg., current_address + stride*4
2013-10-14
Ding Yuan, ECE454
24
Core 2 Hardware Prefetching
Disk
Main
Memory
L2
unified
cache
CPU
Reg
6 MB
32 KB
~4 GB
~500 GB (?)
Not drawn to scale
L1/L2 cache: 64 B blocks
L2->L1 data prefetching
L2->L1 inst prefetching
Mem->L2 
data prefetching
Includes next-block prefetching and multiple streaming prefetchers
 They will only prefetch within a page boundary
(details are kept vague/secret)
2013-10-14
Ding Yuan, ECE454
25
Software Prefetching
Hardware provides special prefetch instructions:
Eg., intel’s 
prefetchnta
 instruction
Compiler or programmer can insert them into the code:
Can PF patterns that hardware wouldn’t recognize (non-strided)
void
process_list(list_t *head){
   list_t *p = head;
   while (p){
     process(p);
     p = p->next;
  }
}
void
process_list_PF(list_t *head){
   list_t *p = head;
   list_t *q;
 
   while (p){
 
     q = p->next;
     prefetch(q);
       
     process(p);
     p = q;
 
  }
}
Assumes process() is long enough to hide the prefetch latency
2013-10-14
Ding Yuan, ECE454
26
Summary:
Optimizing for a Modern
Memory Hierarchy
2013-10-14
Ding Yuan, ECE454
27
Memory Optimization: Summary
Caches
Conflict Misses:
Less of a concern due to high-associativity (8-way L1, 16-way L2)
Cache Capacity:
Main concern: keep working set within on-chip cache capacity
Focus on either L1 or L2 depending on required working-set size
Virtual Memory:
Page Misses:
Keep “big-picture” working set within main memory capacity
TLB Misses: may want to keep working set #pages < TLB #entries
Prefetching:
Try to arrange data structures, access patterns to favour
sequential/strided access
Try compiler or manual-inserted prefetch instructions
2013-10-14
Ding Yuan, ECE454
28
Slide Note
Embed
Share

This article delves into virtual memory concepts, memory layout in Linux systems, advantages of virtual memory over physical memory, and challenges in memory management. It also discusses the issues of fitting data into memory and memory allocation for different processes.

  • Virtual Memory
  • Memory Performance
  • Linux Systems
  • Memory Management
  • Memory Layout

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ECE 454 Computer Systems Programming Memory performance (Part III: Virtual Memory and Prefetching) Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan

  2. Contents Virtual Memory (review (hopefully)) Prefetching 2 Ding Yuan, ECE454 2013-10-14

  3. IA32 Linux Memory Layout Stack Runtime stack (8MB limit) Data Statically allocated data E.g., arrays & strings declared in code Heap Dynamically allocated storage When call malloc(), calloc(), new() Text Executable machine instructions Read-only 3 Ding Yuan, ECE454 2013-10-14

  4. Virtual Memory Programs refer to virtual memory addresses Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to particular process Allocation: Compiler and run-time system Where different program objects should be stored All allocation within single virtual address space But why virtual memory? Why not physical memory? 4 Ding Yuan, ECE454 2013-10-14

  5. A System Using Physical Addressing Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: Physical address (PA) CPU ... M-1: Data word Used in simple embedded microcontrollers Problems for larger, multi-process systems 5 Ding Yuan, ECE454 2013-10-14

  6. Problem 1: How Does Everything Fit? 64-bit addresses: 16 Exabyte Physical main memory: Few Gigabytes ? And there are many processes . 6 Ding Yuan, ECE454 2013-10-14

  7. Problem 2: Memory Management Physical main memory Process 1 Process 2 Process 3 Process n stack heap .text .data x What goes where? 7 Ding Yuan, ECE454 2013-10-14

  8. Problem 3: Portability Machine 1 Machine 2 Physical main memory Physical main memory programA: What goes where? stack heap .text .data What goes where 8 Ding Yuan, ECE454 2013-10-14

  9. Problem 4: How To Protect Physical main memory Process i Process j Problem 5: How To Share? Physical main memory Process i Process j 9 Ding Yuan, ECE454 2013-10-14

  10. Solution: add a Level Of Indirection Virtual memory Process 1 Physical memory mapping Virtual memory Process n Each process gets its own private memory space Solves the previous problems 10 Ding Yuan, ECE454 2013-10-14

  11. A System Using Virtual Addressing Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: CPU Chip Virtual address (VA) Physical address (PA) CPU MMU ... M-1: Data word MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a page table 11 Ding Yuan, ECE454 2013-10-14

  12. VM Turns Main Memory into a Cache Driven by enormous miss penalty: Disk is about 10,000x slower than DRAM DRAM-Cache Design: Large page (block) size: typically 4KB Fully associative Any VP can be placed in any PP Requires a large mapping function different from CPU caches Highly sophisticated, expensive replacement algorithms Too complicated and open-ended to be implemented in hardware Write-back rather than write-through 12 Ding Yuan, ECE454 2013-10-14

  13. MMU Needs Big Table of Translations Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: CPU Chip Virtual address (VA) Physical address (PA) CPU MMU Page Table ... M-1: Data word MMU keeps mapping of VAs -> PAs in a page table 13 Ding Yuan, ECE454 2013-10-14

  14. Reduce Hardware: Page Table in Mem. 2 CPU Chip PTE request 1 Page Table PTE VA MMU CPU 3 Cache/ Memory PA 4 Data 5 1) Processor sends virtual address (VA) to MMU 2-3) MMU requests page table entry (PTE) from page table in memory 4) MMU sends physical address (PA) to cache/memory 5) Cache/memory sends data word to processor If the page is not mapped in physical mem., called a page fault 2013-10-14 14 Ding Yuan, ECE454

  15. Page Fault Exception Page fault handler 4 2 CPU Chip Page Table Victim page PTE request 1 5 VA PTE Cache/ Memory CPU MMU Disk 3 7 New page 6 1) Processor sends virtual address to MMU 2-3) MMU fetches PTE from page table in memory 4) PTE absent so MMU triggers page fault exception 5) Handler identifies victim (and, if dirty, pages it out to disk) 6) Handler pages in new page and updates PTE in memory 7) Handler returns to original process, restarting faulting instruction 15 Ding Yuan, ECE454 2013-10-14

  16. Speeding up Translation with a TLB Page table entries (PTEs) are cached in L1 like any other memory word But PTEs may be evicted by other data references PTE hit still requires a 1-cycle delay Solution: Translation Lookaside Buffer (TLB) Small hardware cache in MMU Caches PTEs for a small number of pages (eg., 256 entries) 16 Ding Yuan, ECE454 2013-10-14

  17. TLB Hit CPU Chip TLB PTE 2 3 VA Page Table 1 PA VA CPU MMU Cache/ Memory 4 Data 5 A TLB hit saves you from accessing memory for the page table 17 Ding Yuan, ECE454 2013-10-14

  18. TLB Miss CPU Chip TLB 4 2 PTE VA Page Table 1 3 VA PTE request CPU MMU Cache/ Memory PA 5 Data 6 A TLB miss incurs an additional memory access (the PTE) 18 Ding Yuan, ECE454 2013-10-14

  19. How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) Thrashing: Performance meltdownwhere pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding 19 Ding Yuan, ECE454 2013-10-14

  20. More on TLBs Assume a 256-entry TLB, 4kB pages 256*4kB = 1MB: can only have TLB hits for 1MB of data This is called the TLB reach ---amount of mem TLB can cover Typical L2 cache is 6MB Hence can t have TLB hits for all of L2 Hence should consider TLB-size before L2 size when tiling? Real CPUs have second-level TLBs This is getting complicated to reason about! Likely have to experiment to find best tile size (HW2) 20 Ding Yuan, ECE454 2013-10-14

  21. Prefetching 21 Ding Yuan, ECE454 2013-10-14

  22. Prefetching ORIGINAL CODE: CODE WITH PREFETCHING: inst1 inst1 inst2 prefetch X inst2 inst3 inst4 inst3 inst4 load X (hits cache) Cache miss latency load X (misses cache) Cache miss latency inst5 (load value is ready) inst6 inst5 (must wait for load value) inst6 Basic idea: Predicts which data will be needed soon (might be wrong) Initiates an early request for that data (like a load-to-cache) If effective, can be used to tolerate latency to memory 22 Ding Yuan, ECE454 2013-10-14

  23. Prefetching is Difficult Prefetching is effective only if all of these are true: There is spare memory bandwidth to begin with Otherwise prefetches could make things worse Prefetches are accurate Only useful if you prefetch data you will soon use Prefetches are timely Ie., prefetch the right data, but not early enough Prefetched data doesn t displace other in-use data Eg: bad if PF replaces a cache block about to be used Latency hidden by prefetches outweighs their cost Cost of many useless prefetches could be significant Ineffective prefetching can hurt performance! 23 Ding Yuan, ECE454 2013-10-14

  24. Hardware Prefetching A simple hardware prefetcher: When one block is accessed prefetch the adjacent block i.e., behaves like blocks are twice as big A more complex hardware prefetcher: Can recognize a stream : addresses separated by a stride Eg1: 0x1, 0x2, 0x3, 0x4, 0x5, 0x6... (stride = 0x1) Eg2: 0x100, 0x300, 0x500, 0x700, 0x900 (stride = 0x200) Prefetch predicted future addresses Eg., current_address + stride*4 24 Ding Yuan, ECE454 2013-10-14

  25. Core 2 Hardware Prefetching L1/L2 cache: 64 B blocks Not drawn to scale L2->L1 inst prefetching 6 MB ~500 GB (?) ~4 GB L1 I-cache L2 Main Memory unified cache 32 KB L1 CPU Reg D-cache Disk Mem->L2 data prefetching L2->L1 data prefetching Includes next-block prefetching and multiple streaming prefetchers They will only prefetch within a page boundary (details are kept vague/secret) 25 Ding Yuan, ECE454 2013-10-14

  26. Software Prefetching Hardware provides special prefetch instructions: Eg., intel s prefetchnta instruction Compiler or programmer can insert them into the code: Can PF patterns that hardware wouldn t recognize (non-strided) void process_list(list_t *head){ list_t *p = head; while (p){ process(p); p = p->next; } } void process_list_PF(list_t *head){ list_t *p = head; list_t *q; while (p){ q = p->next; prefetch(q); process(p); p = q; } } Assumes process() is long enough to hide the prefetch latency 26 Ding Yuan, ECE454 2013-10-14

  27. Summary: Optimizing for a Modern Memory Hierarchy 27 Ding Yuan, ECE454 2013-10-14

  28. Memory Optimization: Summary Caches Conflict Misses: Less of a concern due to high-associativity (8-way L1, 16-way L2) Cache Capacity: Main concern: keep working set within on-chip cache capacity Focus on either L1 or L2 depending on required working-set size Virtual Memory: Page Misses: Keep big-picture working set within main memory capacity TLB Misses: may want to keep working set #pages < TLB #entries Prefetching: Try to arrange data structures, access patterns to favour sequential/strided access Try compiler or manual-inserted prefetch instructions 28 Ding Yuan, ECE454 2013-10-14

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#