Cache Performance Components and Memory Hierarchy

C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
1
5
.
4
 
M
e
a
s
u
r
i
n
g
 
C
a
c
h
e
 
P
e
r
f
o
r
m
a
n
c
e
Components of CPU time
Program execution cycles
Includes cache hit time
Memory stall cycles
Mainly from cache misses
With simplifying assumptions:
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
2
C
a
c
h
e
 
P
e
r
f
o
r
m
a
n
c
e
 
E
x
a
m
p
l
e
Given
I-cache miss rate = 2%
D-cache miss rate = 4%
Miss penalty = 100 cycles
Base CPI (ideal cache) = 2
Load & stores are 36% of instructions
Miss cycles per instruction
I-cache: 0.02 × 100 = 2
D-cache: 0.36 × 0.04 × 100 = 1.44
Actual CPI = 2 + 2 + 1.44 = 5.44
Ideal CPU is 5.44/2 =2.72 times faster
What happens if the processor is faster?
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
3
A
v
e
r
a
g
e
 
A
c
c
e
s
s
 
T
i
m
e
Hit time is also important for performance
Average memory access time (AMAT)
AMAT = Hit time + Miss rate 
× Miss penalty
Example
CPU with 1ns clock, hit time = 1 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
AMAT = 1 + 0.05 × 20 = 2ns
2 cycles per instruction
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
4
P
e
r
f
o
r
m
a
n
c
e
 
S
u
m
m
a
r
y
When CPU performance increased
Miss penalty becomes more significant
Decreasing base CPI
Greater proportion of time spent on memory
stalls
Increasing clock rate
Memory stalls account for more CPU cycles
Can’t neglect cache behavior when
evaluating system performance
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
5
A
s
s
o
c
i
a
t
i
v
e
 
C
a
c
h
e
s
Fully associative
Allow a given block to go in any cache entry
Requires all entries to be searched at once
Comparator per entry (expensive)
n
-way set associative
Each set contains 
n
 entries
Block number determines which set
(Block number) modulo (#Sets in cache)
Search all entries in a given set at once
n
 comparators (less expensive)
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
6
A
s
s
o
c
i
a
t
i
v
e
 
C
a
c
h
e
 
E
x
a
m
p
l
e
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
7
S
p
e
c
t
r
u
m
 
o
f
 
A
s
s
o
c
i
a
t
i
v
i
t
y
For a cache with 8 entries
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
8
A
s
s
o
c
i
a
t
i
v
i
t
y
 
E
x
a
m
p
l
e
Compare 4-block caches
Direct mapped, 2-way set associative,
fully associative
Block access sequence: 0, 8, 0, 6, 8
Direct mapped
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
9
A
s
s
o
c
i
a
t
i
v
i
t
y
 
E
x
a
m
p
l
e
2-way set associative
Fully associative
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
10
10
H
o
w
 
M
u
c
h
 
A
s
s
o
c
i
a
t
i
v
i
t
y
Increased associativity decreases miss
rate
But with diminishing returns
Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
1-way: 10.3%
2-way: 8.6%
4-way: 8.3%
8-way: 8.1%
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
11
11
S
e
t
 
A
s
s
o
c
i
a
t
i
v
e
 
C
a
c
h
e
 
O
r
g
a
n
i
z
a
t
i
o
n
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
12
12
R
e
p
l
a
c
e
m
e
n
t
 
P
o
l
i
c
y
Direct mapped: no choice
Set associative
Prefer non-valid entry, if there is one
Otherwise, choose among entries in the set
Least-recently used (LRU)
Choose the one unused for the longest time
Simple for 2-way, manageable for 4-way, too hard
beyond that
Random
Gives approximately the same performance
as LRU for high associativity
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
13
13
M
u
l
t
i
l
e
v
e
l
 
C
a
c
h
e
s
Primary cache attached to CPU
Small, but fast
Level-2 cache services misses from
primary cache
Larger, slower, but still faster than main
memory
Main memory services L-2 cache misses
Some high-end systems include L-3 cache
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
14
14
M
u
l
t
i
l
e
v
e
l
 
C
a
c
h
e
 
E
x
a
m
p
l
e
Given
CPU base CPI = 1, clock rate = 4GHz
Miss rate/instruction = 2%
Main memory access time = 100ns
With just primary cache
Miss penalty = 100ns/0.25ns = 400 cycles
Effective CPI = 1 + 0.02 × 400 = 9
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
15
15
E
x
a
m
p
l
e
 
(
c
o
n
t
.
)
Now add L-2 cache
Access time = 5ns
Global miss rate to main memory = 0.5%
Primary miss with L-2 hit
Penalty = 5ns/0.25ns = 20 cycles
Primary miss with L-2 miss
Extra penalty = 500 cycles
CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
Performance ratio = 9/3.4 = 2.6
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
16
16
M
u
l
t
i
l
e
v
e
l
 
C
a
c
h
e
 
C
o
n
s
i
d
e
r
a
t
i
o
n
s
Primary cache
Focus on minimal hit time
L-2 cache
Focus on low miss rate to avoid main memory
access
Hit time has less overall impact
Results
L-1 cache usually smaller than a single cache
L-1 block size smaller than L-2 block size
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
17
17
I
n
t
e
r
a
c
t
i
o
n
s
 
w
i
t
h
 
A
d
v
a
n
c
e
d
 
C
P
U
s
Out-of-order CPUs can execute
instructions during cache miss
Pending store stays in load/store unit
Dependent instructions wait in reservation
stations
Independent instructions continue
Effect of miss depends on program data
flow
Much harder to analyse
Use system simulation
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
18
18
I
n
t
e
r
a
c
t
i
o
n
s
 
w
i
t
h
 
S
o
f
t
w
a
r
e
Misses depend on
memory access
patterns
Algorithm behavior
Compiler
optimization for
memory access
S
o
f
t
w
a
r
e
 
O
p
t
i
m
i
z
a
t
i
o
n
 
v
i
a
 
B
l
o
c
k
i
n
g
Goal:  maximize accesses to data before it
is replaced
Consider inner loops of DGEMM:
  for (int j = 0; j < n; ++j)
  {
    double cij = C[i+j*n];
    for( int k = 0; k < n; k++ )
      cij += A[i+k*n] * B[k+j*n];
    C[i+j*n] = cij;
  }
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
19
19
D
G
E
M
M
 
A
c
c
e
s
s
 
P
a
t
t
e
r
n
C, A, and B arrays
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
20
20
older accesses
new accesses
C
a
c
h
e
 
B
l
o
c
k
e
d
 
D
G
E
M
M
1 #define BLOCKSIZE 32
2 void do_block (int n, int si, int sj, int sk, double *A, double
3 *B, double *C)
4 {
5  for (int i = si; i < si+BLOCKSIZE; ++i)
6   for (int j = sj; j < sj+BLOCKSIZE; ++j)
7   {
8    double cij = C[i+j*n];/* cij = C[i][j] */
9    for( int k = sk; k < sk+BLOCKSIZE; k++ )
10    cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */
11   C[i+j*n] = cij;/* C[i][j] = cij */
12  }
13 }
14 void dgemm (int n, double* A, double* B, double* C)
15 {
16  for ( int sj = 0; sj < n; sj += BLOCKSIZE )
17   for ( int si = 0; si < n; si += BLOCKSIZE )
18    for ( int sk = 0; sk < n; sk += BLOCKSIZE )
19     do_block(n, si, sj, sk, A, B, C);
20 }
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
21
21
B
l
o
c
k
e
d
 
D
G
E
M
M
 
A
c
c
e
s
s
 
P
a
t
t
e
r
n
C
h
a
p
t
e
r
 
5
 
 
L
a
r
g
e
 
a
n
d
 
F
a
s
t
:
 
E
x
p
l
o
i
t
i
n
g
 
M
e
m
o
r
y
 
H
i
e
r
a
r
c
h
y
 
 
22
22
Unoptimized
Blocked
Slide Note

Morgan Kaufmann Publishers

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Embed
Share

Exploring cache performance components, such as hit time and memory stall cycles, is crucial for evaluating system performance. By analyzing factors like miss rates and penalties, one can optimize CPU efficiency and reduce memory stalls. Associative caches offer flexible options for organizing data efficiently in memory. Delve into these concepts for a deeper understanding of memory hierarchy in computing systems.

  • Cache Performance
  • Memory Hierarchy
  • CPU Efficiency
  • Associative Caches
  • System Optimization

Uploaded on Sep 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. 5.4 Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions: Memory stall cycles Memory accesses = Miss rate Miss penalty Program Instructio ns Misses = Miss penalty Program Instructio n Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

  2. Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 100 = 2 D-cache: 0.36 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster What happens if the processor is faster? Chapter 5 Large and Fast: Exploiting Memory Hierarchy 2

  3. Average Access Time Hit time is also important for performance Average memory access time (AMAT) AMAT = Hit time + Miss rate Miss penalty Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% AMAT = 1 + 0.05 20 = 2ns 2 cycles per instruction Chapter 5 Large and Fast: Exploiting Memory Hierarchy 3

  4. Performance Summary When CPU performance increased Miss penalty becomes more significant Decreasing base CPI Greater proportion of time spent on memory stalls Increasing clock rate Memory stalls account for more CPU cycles Can t neglect cache behavior when evaluating system performance Chapter 5 Large and Fast: Exploiting Memory Hierarchy 4

  5. Associative Caches Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block number determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5

  6. Associative Cache Example Chapter 5 Large and Fast: Exploiting Memory Hierarchy 6

  7. Spectrum of Associativity For a cache with 8 entries Chapter 5 Large and Fast: Exploiting Memory Hierarchy 7

  8. Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequence: 0, 8, 0, 6, 8 Direct mapped Block address 0 8 0 6 8 Cache index 0 0 0 2 0 Hit/miss Cache content after access 1 0 2 3 miss miss miss miss miss Mem[0] Mem[8] Mem[0] Mem[0] Mem[8] Mem[6] Mem[6] Chapter 5 Large and Fast: Exploiting Memory Hierarchy 8

  9. Associativity Example 2-way set associative Block address 0 8 0 6 8 Cache index 0 0 0 0 0 Hit/miss Cache content after access Set 0 Mem[0] Mem[0] Mem[8] Mem[0] Mem[8] Mem[0] Mem[6] Mem[8] Mem[6] Set 1 miss miss hit miss miss Fully associative Block address 0 8 0 6 8 Hit/miss Cache content after access miss miss hit miss hit Mem[0] Mem[0] Mem[0] Mem[0] Mem[0] Mem[8] Mem[8] Mem[8] Mem[8] Mem[6] Mem[6] Chapter 5 Large and Fast: Exploiting Memory Hierarchy 9

  10. How Much Associativity Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 Large and Fast: Exploiting Memory Hierarchy 10

  11. Set Associative Cache Organization Chapter 5 Large and Fast: Exploiting Memory Hierarchy 11

  12. Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 Large and Fast: Exploiting Memory Hierarchy 12

  13. Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache Chapter 5 Large and Fast: Exploiting Memory Hierarchy 13

  14. Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 400 = 9 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 14

  15. Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 500 cycles CPI = 1 + 0.02 20 + 0.005 400 = 3.4 Performance ratio = 9/3.4 = 2.6 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 15

  16. Multilevel Cache Considerations Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size Chapter 5 Large and Fast: Exploiting Memory Hierarchy 16

  17. Interactions with Advanced CPUs Out-of-order CPUs can execute instructions during cache miss Pending store stays in load/store unit Dependent instructions wait in reservation stations Independent instructions continue Effect of miss depends on program data flow Much harder to analyse Use system simulation Chapter 5 Large and Fast: Exploiting Memory Hierarchy 17

  18. Interactions with Software Misses depend on memory access patterns Algorithm behavior Compiler optimization for memory access Chapter 5 Large and Fast: Exploiting Memory Hierarchy 18

  19. Software Optimization via Blocking Goal: maximize accesses to data before it is replaced Consider inner loops of DGEMM: for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; } Chapter 5 Large and Fast: Exploiting Memory Hierarchy 19

  20. DGEMM Access Pattern C, A, and B arrays older accesses new accesses Chapter 5 Large and Fast: Exploiting Memory Hierarchy 20

  21. Cache Blocked DGEMM 1 #define BLOCKSIZE 32 2 void do_block (int n, int si, int sj, int sk, double *A, double 3 *B, double *C) 4 { 5 for (int i = si; i < si+BLOCKSIZE; ++i) 6 for (int j = sj; j < sj+BLOCKSIZE; ++j) 7 { 8 double cij = C[i+j*n];/* cij = C[i][j] */ 9 for( int k = sk; k < sk+BLOCKSIZE; k++ ) 10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */ 11 C[i+j*n] = cij;/* C[i][j] = cij */ 12 } 13 } 14 void dgemm (int n, double* A, double* B, double* C) 15 { 16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 17 for ( int si = 0; si < n; si += BLOCKSIZE ) 18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 19 do_block(n, si, sj, sk, A, B, C); 20 } Chapter 5 Large and Fast: Exploiting Memory Hierarchy 21

  22. Blocked DGEMM Access Pattern Blocked Unoptimized Chapter 5 Large and Fast: Exploiting Memory Hierarchy 22

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#