ECE 454 Computer Systems Programming

ECE 454

Computer Systems Programming

Memory performance

(Part III: Virtual Memory and Prefetching)

Ding Yuan

ECE Dept., University of Toronto

http://www.eecg.toronto.edu/~yuan

Contents

•

Virtual Memory (review (hopefully))

•

Prefetching

2013-10-14

Ding Yuan, ECE454

IA32 Linux Memory Layout

•

Stack

•

Runtime stack (8MB limit)

•

Data

•

Statically allocated data

•

E.g., arrays & strings declared

in code

•

Heap

•

Dynamically allocated storage

•

When call

malloc(),

calloc(), new()

•

Text

•

Executable machine

instructions

•

Read-only

2013-10-14

Ding Yuan, ECE454

•

Programs refer to virtual memory addresses

•

Conceptually very large array of bytes (4GB for IA32, 16

exabytes for 64 bits)

•

Each byte has its own address

•

System provides address space private to particular “process”

•

Allocation: Compiler and run-time system

•

Where different program objects should be stored

•

All allocation within single virtual address space

•

But why virtual memory?

•

Why not physical memory?

Virtual Memory

2013-10-14

Ding Yuan, ECE454

A System Using Physical Addressing

•

Used in “simple” embedded microcontrollers

•

Problems for larger, multi-process systems

0:

1:

M-1:

Main memory

CPU

2:

3:

4:

5:

6:

7:

Physical address

(PA)

Data word

8:

2013-10-14

Ding Yuan, ECE454

Problem 1: How Does Everything Fit?

64-bit addresses:

16 Exabyte

Physical main memory:

Few Gigabytes

And there are many processes ….

2013-10-14

Ding Yuan, ECE454

Problem 2: Memory Management

Physical main memory

What goes

where?

stack

heap

.text

.data

…

Process 1

Process 2

Process 3

…

Process n

2013-10-14

Ding Yuan, ECE454

Problem 3: Portability

Machine 1

Physical main memory

What goes

where

programA:

stack

heap

.text

.data

…

Machine 2

Physical main memory

What goes

where?

2013-10-14

Ding Yuan, ECE454

Problem 4: How To Protect

Physical main memory

Process i

Process j

Problem 5: How To Share?

Physical main memory

Process i

Process j

2013-10-14

Ding Yuan, ECE454

Solution: add a Level Of Indirection

•

Each process gets its own private memory space

•

Solves the previous problems

Physical memory

Virtual memory

Virtual memory

Process 1

Process n

mapping

2013-10-14

Ding Yuan, ECE454

A System Using Virtual Addressing

•

MMU = Memory Management Unit

•

MMU keeps mapping of VAs -> PAs in a “page table”

0:

1:

M-1:

Main memory

MMU

2:

3:

4:

5:

6:

7:

Physical address

(PA)

Data word

8:

CPU

Virtual address

(VA)

CPU Chip

2013-10-14

Ding Yuan, ECE454

VM Turns Main Memory into a Cache

•

Driven by enormous miss penalty:

•

Disk is about

10,000x

 slower than DRAM

•

DRAM-Cache Design:

•

Large page (block) size: typically 4KB

•

Fully associative

•

Any VP can be placed in any PP

•

Requires a “large” mapping function – different from CPU caches

•

Highly sophisticated, expensive replacement algorithms

•

Too complicated and open-ended to be implemented in hardware

•

Write-back rather than write-through

2013-10-14

Ding Yuan, ECE454

MMU Needs Big Table of Translations

•

MMU keeps mapping of VAs -> PAs in a “page table”

0:

1:

M-1:

Main memory

MMU

2:

3:

4:

5:

6:

7:

Physical address

(PA)

Data word

8:

CPU

Virtual address

(VA)

CPU Chip

Page Table

2013-10-14

Ding Yuan, ECE454

Reduce Hardware: Page Table in Mem.

•

1) Processor sends virtual address (VA) to MMU

•

2-3) MMU requests page table entry (PTE) from page table in memory

•

4) MMU sends physical address (PA) to cache/memory

•

5) Cache/memory sends data word to processor

MMU

Cache/

Memory

PA

Data

CPU

VA

CPU Chip

PTE request

PTE

Page Table

If

If

the page is not mapped in physical mem.

the page is not mapped in physical mem.

, called a “page fault”

, called a “page fault”

2013-10-14

Ding Yuan, ECE454

Page Fault

•

1) Processor sends virtual address to MMU

•

2-3) MMU fetches PTE from page table in memory

•

4) PTE absent so MMU triggers page fault exception

•

5) Handler identifies victim (and, if dirty, pages it out to disk)

•

6) Handler pages in new page and updates PTE in memory

•

7) Handler returns to original process, restarting faulting instruction

MMU

Cache/

Memory

CPU

VA

CPU Chip

PTE request

PTE

Disk

Page fault handler

Victim page

New page

Exception

Page Table

2013-10-14

Ding Yuan, ECE454

Speeding up Translation with a TLB

•

Page table entries (PTEs) are cached in L1 like any other

memory word

•

But PTEs may be evicted by other data references

•

PTE hit still requires a 1-cycle delay

•

Solution:

Translation Lookaside Buffer

 (TLB)

•

Small hardware cache in MMU

•

Caches PTEs for a small number of pages (eg., 256 entries)

2013-10-14

Ding Yuan, ECE454

TLB Hit

MMU

Cache/

Memory

PA

Data

CPU

VA

CPU Chip

PTE

A TLB hit saves you from accessing memory for the page table

TLB

VA

Page Table

2013-10-14

Ding Yuan, ECE454

TLB Miss

MMU

Cache/

Memory

PA

Data

CPU

VA

CPU Chip

PTE

TLB

VA

PTE request

A TLB miss incurs an additional memory access (the PTE)

Page Table

2013-10-14

Ding Yuan, ECE454

How to Program for Virtual Memory

•

At any point in time, programs tend to access a set of active

virtual pages called the

working set

•

Programs with better temporal locality will have smaller

working sets

•

If ((working set size) > main mem size)

•

Thrashing:

Performance meltdown

where pages are swapped

(copied) in and out continuously

•

If ((# working set pages) > # TLB entries)

•

Will suffer TLB misses

•

Not as bad as page thrashing, but still worth avoiding

2013-10-14

Ding Yuan, ECE454

More on TLBs

•

Assume a 256-entry TLB, 4kB pages

•

256*4kB = 1MB: can only have TLB hits for 1MB of data

•

This is called the “TLB reach”---amount of mem TLB can cover

•

Typical L2 cache is 6MB

•

Hence can’t have TLB hits for all of L2

•

Hence should consider TLB-size before L2 size when tiling?

•

Real CPUs have second-level TLBs

•

This is getting complicated to reason about!

•

Likely have to experiment to find best tile size (HW2)

2013-10-14

Ding Yuan, ECE454

Prefetching

2013-10-14

Ding Yuan, ECE454

Prefetching

•

Basic idea:

•

Predicts which data will be needed soon (might be wrong)

•

Initiates an early request for that data (like a load-to-cache)

•

If effective, can be used to tolerate latency to memory

load X (misses cache)

inst4

inst3

inst2

inst1

inst6

inst5 (must wait for load value)

Cache miss latency

ORIGINAL CODE:

prefetch X

inst3

inst2

inst4

inst1

inst6

inst5 (load value is ready)

Cache miss latency

CODE WITH PREFETCHING:

load  X (hits cache)

2013-10-14

Ding Yuan, ECE454

Prefetching is Difficult

•

Prefetching is effective only if all of these are true:

•

There is spare memory bandwidth to begin with

•

Otherwise prefetches could make things worse

•

Prefetches are

accurate

•

Only useful if you prefetch data you will soon use

•

Prefetches are

timely

•

Ie., prefetch the right data, but not early enough

•

Prefetched data doesn’t displace other in-use data

•

Eg: bad if PF replaces a cache block about to be used

•

Latency hidden by prefetches outweighs their cost

•

ost of many useless prefetches could be significant

•

Ineffective prefetching can hurt performance!

2013-10-14

Ding Yuan, ECE454

Hardware Prefetching

•

A simple hardware prefetcher:

•

When one block is accessed prefetch the adjacent block

•

i.e., behaves like blocks are twice as big

•

A more complex hardware prefetcher:

•

Can recognize a “stream”: addresses separated by a “stride”

•

Eg1: 0x1, 0x2, 0x3, 0x4, 0x5, 0x6...   (stride = 0x1)

•

Eg2: 0x100, 0x300, 0x500, 0x700, 0x900…  (stride = 0x200)

•

Prefetch predicted future addresses

•

Eg., current_address + stride*4

2013-10-14

Ding Yuan, ECE454

Core 2 Hardware Prefetching

Disk

Main

Memory

L2

unified

cache

CPU

Reg

6 MB

32 KB

~4 GB

~500 GB (?)

Not drawn to scale

L1/L2 cache: 64 B blocks

L2->L1 data prefetching

L2->L1 inst prefetching

Mem->L2

data prefetching

Includes next-block prefetching and multiple streaming prefetchers

 They will only prefetch within a page boundary

(details are kept vague/secret)

2013-10-14

Ding Yuan, ECE454

Software Prefetching

•

Hardware provides special prefetch instructions:

•

Eg., intel’s

prefetchnta

 instruction

•

Compiler or programmer can insert them into the code:

•

Can PF patterns that hardware wouldn’t recognize (non-strided)

void

process_list(list_t *head){

   list_t *p = head;

   while (p){

     process(p);

     p = p->next;

void

process_list_PF(list_t *head){

   list_t *p = head;

   list_t *q;

   while (p){

     q = p->next;

     prefetch(q);

     process(p);

     p = q;

Assumes process() is long enough to hide the prefetch latency

2013-10-14

Ding Yuan, ECE454

Summary:

Optimizing for a Modern

Memory Hierarchy

2013-10-14

Ding Yuan, ECE454

Memory Optimization: Summary

•

Caches

•

Conflict Misses:

•

Less of a concern due to high-associativity (8-way L1, 16-way L2)

•

Cache Capacity:

•

Main concern: keep working set within on-chip cache capacity

•

Focus on either L1 or L2 depending on required working-set size

•

Virtual Memory:

•

Page Misses:

•

Keep “big-picture” working set within main memory capacity

•

TLB Misses: may want to keep working set #pages < TLB #entries

•

Prefetching:

•

Try to arrange data structures, access patterns to favour

sequential/strided access

•

Try compiler or manual-inserted prefetch instructions

2013-10-14

Ding Yuan, ECE454

Slide Note

Embed Share

Download

This article delves into virtual memory concepts, memory layout in Linux systems, advantages of virtual memory over physical memory, and challenges in memory management. It also discusses the issues of fitting data into memory and memory allocation for different processes.

pol_sim Follow

Uploaded on Feb 17, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ECE 454 Computer Systems Programming Memory performance (Part III: Virtual Memory and Prefetching) Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan

Contents Virtual Memory (review (hopefully)) Prefetching 2 Ding Yuan, ECE454 2013-10-14

IA32 Linux Memory Layout Stack Runtime stack (8MB limit) Data Statically allocated data E.g., arrays & strings declared in code Heap Dynamically allocated storage When call malloc(), calloc(), new() Text Executable machine instructions Read-only 3 Ding Yuan, ECE454 2013-10-14

Virtual Memory Programs refer to virtual memory addresses Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to particular process Allocation: Compiler and run-time system Where different program objects should be stored All allocation within single virtual address space But why virtual memory? Why not physical memory? 4 Ding Yuan, ECE454 2013-10-14

A System Using Physical Addressing Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: Physical address (PA) CPU ... M-1: Data word Used in simple embedded microcontrollers Problems for larger, multi-process systems 5 Ding Yuan, ECE454 2013-10-14

Problem 1: How Does Everything Fit? 64-bit addresses: 16 Exabyte Physical main memory: Few Gigabytes ? And there are many processes . 6 Ding Yuan, ECE454 2013-10-14

Problem 2: Memory Management Physical main memory Process 1 Process 2 Process 3 Process n stack heap .text .data x What goes where? 7 Ding Yuan, ECE454 2013-10-14

Problem 3: Portability Machine 1 Machine 2 Physical main memory Physical main memory programA: What goes where? stack heap .text .data What goes where 8 Ding Yuan, ECE454 2013-10-14

Problem 4: How To Protect Physical main memory Process i Process j Problem 5: How To Share? Physical main memory Process i Process j 9 Ding Yuan, ECE454 2013-10-14

Solution: add a Level Of Indirection Virtual memory Process 1 Physical memory mapping Virtual memory Process n Each process gets its own private memory space Solves the previous problems 10 Ding Yuan, ECE454 2013-10-14

A System Using Virtual Addressing Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: CPU Chip Virtual address (VA) Physical address (PA) CPU MMU ... M-1: Data word MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a page table 11 Ding Yuan, ECE454 2013-10-14

VM Turns Main Memory into a Cache Driven by enormous miss penalty: Disk is about 10,000x slower than DRAM DRAM-Cache Design: Large page (block) size: typically 4KB Fully associative Any VP can be placed in any PP Requires a large mapping function different from CPU caches Highly sophisticated, expensive replacement algorithms Too complicated and open-ended to be implemented in hardware Write-back rather than write-through 12 Ding Yuan, ECE454 2013-10-14

MMU Needs Big Table of Translations Main memory 0: 1: 2: 3: 4: 5: 6: 7: 8: CPU Chip Virtual address (VA) Physical address (PA) CPU MMU Page Table ... M-1: Data word MMU keeps mapping of VAs -> PAs in a page table 13 Ding Yuan, ECE454 2013-10-14

Reduce Hardware: Page Table in Mem. 2 CPU Chip PTE request 1 Page Table PTE VA MMU CPU 3 Cache/ Memory PA 4 Data 5 1) Processor sends virtual address (VA) to MMU 2-3) MMU requests page table entry (PTE) from page table in memory 4) MMU sends physical address (PA) to cache/memory 5) Cache/memory sends data word to processor If the page is not mapped in physical mem., called a page fault 2013-10-14 14 Ding Yuan, ECE454

Page Fault Exception Page fault handler 4 2 CPU Chip Page Table Victim page PTE request 1 5 VA PTE Cache/ Memory CPU MMU Disk 3 7 New page 6 1) Processor sends virtual address to MMU 2-3) MMU fetches PTE from page table in memory 4) PTE absent so MMU triggers page fault exception 5) Handler identifies victim (and, if dirty, pages it out to disk) 6) Handler pages in new page and updates PTE in memory 7) Handler returns to original process, restarting faulting instruction 15 Ding Yuan, ECE454 2013-10-14

Speeding up Translation with a TLB Page table entries (PTEs) are cached in L1 like any other memory word But PTEs may be evicted by other data references PTE hit still requires a 1-cycle delay Solution: Translation Lookaside Buffer (TLB) Small hardware cache in MMU Caches PTEs for a small number of pages (eg., 256 entries) 16 Ding Yuan, ECE454 2013-10-14

TLB Hit CPU Chip TLB PTE 2 3 VA Page Table 1 PA VA CPU MMU Cache/ Memory 4 Data 5 A TLB hit saves you from accessing memory for the page table 17 Ding Yuan, ECE454 2013-10-14

TLB Miss CPU Chip TLB 4 2 PTE VA Page Table 1 3 VA PTE request CPU MMU Cache/ Memory PA 5 Data 6 A TLB miss incurs an additional memory access (the PTE) 18 Ding Yuan, ECE454 2013-10-14

How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) Thrashing: Performance meltdownwhere pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding 19 Ding Yuan, ECE454 2013-10-14

More on TLBs Assume a 256-entry TLB, 4kB pages 256*4kB = 1MB: can only have TLB hits for 1MB of data This is called the TLB reach ---amount of mem TLB can cover Typical L2 cache is 6MB Hence can t have TLB hits for all of L2 Hence should consider TLB-size before L2 size when tiling? Real CPUs have second-level TLBs This is getting complicated to reason about! Likely have to experiment to find best tile size (HW2) 20 Ding Yuan, ECE454 2013-10-14

Prefetching 21 Ding Yuan, ECE454 2013-10-14

Prefetching ORIGINAL CODE: CODE WITH PREFETCHING: inst1 inst1 inst2 prefetch X inst2 inst3 inst4 inst3 inst4 load X (hits cache) Cache miss latency load X (misses cache) Cache miss latency inst5 (load value is ready) inst6 inst5 (must wait for load value) inst6 Basic idea: Predicts which data will be needed soon (might be wrong) Initiates an early request for that data (like a load-to-cache) If effective, can be used to tolerate latency to memory 22 Ding Yuan, ECE454 2013-10-14

Prefetching is Difficult Prefetching is effective only if all of these are true: There is spare memory bandwidth to begin with Otherwise prefetches could make things worse Prefetches are accurate Only useful if you prefetch data you will soon use Prefetches are timely Ie., prefetch the right data, but not early enough Prefetched data doesn t displace other in-use data Eg: bad if PF replaces a cache block about to be used Latency hidden by prefetches outweighs their cost Cost of many useless prefetches could be significant Ineffective prefetching can hurt performance! 23 Ding Yuan, ECE454 2013-10-14

Hardware Prefetching A simple hardware prefetcher: When one block is accessed prefetch the adjacent block i.e., behaves like blocks are twice as big A more complex hardware prefetcher: Can recognize a stream : addresses separated by a stride Eg1: 0x1, 0x2, 0x3, 0x4, 0x5, 0x6... (stride = 0x1) Eg2: 0x100, 0x300, 0x500, 0x700, 0x900 (stride = 0x200) Prefetch predicted future addresses Eg., current_address + stride*4 24 Ding Yuan, ECE454 2013-10-14

Core 2 Hardware Prefetching L1/L2 cache: 64 B blocks Not drawn to scale L2->L1 inst prefetching 6 MB ~500 GB (?) ~4 GB L1 I-cache L2 Main Memory unified cache 32 KB L1 CPU Reg D-cache Disk Mem->L2 data prefetching L2->L1 data prefetching Includes next-block prefetching and multiple streaming prefetchers They will only prefetch within a page boundary (details are kept vague/secret) 25 Ding Yuan, ECE454 2013-10-14

Software Prefetching Hardware provides special prefetch instructions: Eg., intel s prefetchnta instruction Compiler or programmer can insert them into the code: Can PF patterns that hardware wouldn t recognize (non-strided) void process_list(list_t *head){ list_t *p = head; while (p){ process(p); p = p->next; } } void process_list_PF(list_t *head){ list_t *p = head; list_t *q; while (p){ q = p->next; prefetch(q); process(p); p = q; } } Assumes process() is long enough to hide the prefetch latency 26 Ding Yuan, ECE454 2013-10-14

Summary: Optimizing for a Modern Memory Hierarchy 27 Ding Yuan, ECE454 2013-10-14

Memory Optimization: Summary Caches Conflict Misses: Less of a concern due to high-associativity (8-way L1, 16-way L2) Cache Capacity: Main concern: keep working set within on-chip cache capacity Focus on either L1 or L2 depending on required working-set size Virtual Memory: Page Misses: Keep big-picture working set within main memory capacity TLB Misses: may want to keep working set #pages < TLB #entries Prefetching: Try to arrange data structures, access patterns to favour sequential/strided access Try compiler or manual-inserted prefetch instructions 28 Ding Yuan, ECE454 2013-10-14

ECE 454 Computer Systems Programming

Download Presentation

Presentation Transcript

Related

More Related Content