Trends in Implicit Parallelism and Microprocessor Architectures

Parallel Programming Platforms
Implicit Parallelism:
Trends in Microprocessor & Architectures,
Limitations of Memory System Performance
Prof Vijay More, MET, IOE, BKC, Nashik
 
S
c
o
p
e
 
o
f
 
P
a
r
a
l
l
e
l
i
s
m
 
Conventional architectures comprise of a
processor, memory system, and the datapath.
 
Each of these components presents
considerable performance bottlenecks.
 
Parallelism addresses each of these components
in significant ways.
Prof Vijay More, MET, IOE, BKC, Nashik
 
Different applications utilize different aspects
of parallelism since their requirement is
different
 
e.g.,
o
data intensive applications utilize high
aggregate throughput,
o
server applications utilize high aggregate
network bandwidth,
o
and scientific applications typically utilize
high processing and memory system
performance.
It is important to understand each of these
performance bottlenecks.
Prof Vijay More, MET, IOE, BKC, Nashik
I
m
p
l
i
c
i
t
 
P
a
r
a
l
l
e
l
i
s
m
:
 
T
r
e
n
d
s
 
i
n
M
i
c
r
o
p
r
o
c
e
s
s
o
r
 
A
r
c
h
i
t
e
c
t
u
r
e
s
 
Microprocessor clock speeds have recorded
impressive gains over the past two decades
(two to three orders of magnitude).
 
Higher levels of device integration have
made available a large number of
transistors.
Prof Vijay More, MET, IOE, BKC, Nashik
How best to utilize these resources?.
 
Current processors use these resources in
multiple functional units 
and 
execute
multiple instructions in the same cycle
.
 
The precise manner in which these
instructions are selected and executed
provides remarkable diversity in
architectures.
 
Increased
 overhead of compiler to find ou
t
parallelism from the code if it is implicit.
Prof Vijay More, MET, IOE, BKC, Nashik
Current Generation Computer Systems
1 BlueGene/L (IBM / LLNL)
2 Purple (IBM / LLNL)
3 Q (HP / LANL)
4 Red Storm (Cray / SNL)
5 X-1 (Cray / ORNL)
6 Earth Simulator (NEC / Japan)
7 Linux Clusters (Various Developers / Various Sites)
8 System X (Apple / Virginia Tech)
9 Columbia (SGI / NASA)
10 MareNostrum (IBM / Spain)
Prof Vijay More, MET, IOE, BKC, Nashik
L
i
m
i
t
a
t
i
o
n
s
 
o
f
 
M
e
m
o
r
y
 
S
y
s
t
e
m
P
e
r
f
o
r
m
a
n
c
e
 
Memory system, (and not processor speed), is
often the bottleneck for many applications.
 
Memory system performance is largely affected by
two parameters, 
latency
 and 
bandwidth
.
 
Latency
 is the time from the issue of a memory
request to the time the data is available at the
processor.
 
Bandwidth
 is the rate at which data can be
transferred
 between 
processor and memory
system.
Prof Vijay More, MET, IOE, BKC, Nashik
M
e
m
o
r
y
 
S
y
s
t
e
m
 
P
e
r
f
o
r
m
a
n
c
e
:
B
a
n
d
w
i
d
t
h
 
a
n
d
 
L
a
t
e
n
c
y
Consider the example of a fire-hose. If the water
comes out
 
of the hose two seconds after the
hydrant is turned on, the
 
latency of the system is
two seconds.
 
Once the water starts flowing, if the hydrant
delivers water at the rate of 5 gallons/second, the
bandwidth of the system is 5 gallons/second.
 
If you want immediate response from the hydrant,
it is important to reduce latency.
 
If you want to fight big fires, you want high
bandwidth
.
Prof Vijay More, MET, IOE, BKC, Nashik
M
e
m
o
r
y
 
L
a
t
e
n
c
y
:
 
A
n
 
E
x
a
m
p
l
e
Consider a processor operating at 1 GHz (1 ns
clock) connected to a DRAM with a latency of 100 ns
(without caches).
Assume that the processor has two multiply-add
units and is capable of executing four instructions in
each cycle of 1 ns.
The following observations follow:
The peak processor rating is 4 GFLOPS.
Since the memory latency is 100 cycles and block
size is one word, every time a memory request is
made, the processor must wait 100 cycles before it
can process the data.
Prof Vijay More, MET, IOE, BKC, Nashik
M
e
m
o
r
y
 
L
a
t
e
n
c
y
:
 
A
n
 
E
x
a
m
p
l
e
On the above architecture, consider the problem of
computing a dot-product of two vectors.
A dot-product computation performs one multiply-
add on a single pair of vector elements, i.e., each
floating point operation requires one data fetch.
It follows that the peak speed of this computation is
limited to 
one floating point operation every 100 ns
,
or a speed of 10 MFLOPS, a very small fraction of
the peak processor rating!
Prof Vijay More, MET, IOE, BKC, Nashik
FLOPS
Floating point operations per second.
Floating point operations are heavily used by
science and engineering computational
applications. This is one of the measure of
performance for the system which is abbreviated
as 
flops
 with prefix:
mega 10
6
 megaflops MFLOPS
giga 10
9
 gigaflops GFLOPS
tera 10
12
 teraflops TFLOPS
peta 10
15
 petaflops PFLOPS
Prof Vijay More, MET, IOE, BKC, Nashik
MFLOPS
megaflops also called millions of floating point operations
per second. MFLOPS rating is computed by dividing
number of operations by execution time:
Ex. multiplication of nxn matrix
number of operations are 2n
3
-n
n
2
 inner products with n multiplications and n-1 additions
in each product.
if 100x100 matrix multiplication computed in 0.35
seconds, then
MFLOPS rate = (2 (100)
3
 -100) / 0.35 = 5,714,000 ops per
second = 5.714 MFLOPS
Prof Vijay More, MET, IOE, BKC, Nashik
I
m
p
r
o
v
i
n
g
 
E
f
f
e
c
t
i
v
e
 
M
e
m
o
r
y
 
L
a
t
e
n
c
y
 
U
s
i
n
g
C
a
c
h
e
s
Caches are small and fast memory elements
between the processor and DRAM.
Important property is 
low-latency high-bandwidth
storage
.
If a piece of data is repeatedly used, the effective
latency of this memory system can be reduced by
the cache.
The fraction of data references satisfied by the
cache is called the cache 
hit ratio 
of the computation
on the system.
Prof Vijay More, MET, IOE, BKC, Nashik
I
m
p
a
c
t
 
o
f
 
C
a
c
h
e
s
:
 
E
x
a
m
p
l
e
Consider the architecture from the previous example.
In this case, we introduce a cache of size 32 KB with
a latency of 1 ns or one cycle.
We use this setup to multiply two matrices 
A 
and 
B 
of
dimensions 
32 
× 
32
. We have carefully chosen these
numbers so that the cache is large enough to store
matrices 
A 
and 
B
, as well as the result matrix 
C
.
Prof Vijay More, MET, IOE, BKC, Nashik
I
m
p
a
c
t
 
o
f
 
C
a
c
h
e
s
:
 
E
x
a
m
p
l
e
 
(
c
o
n
t
i
n
u
e
d
)
Observations:
• Fetching the two matrices into the cache
corresponds to 
fetching 2K words, which takes
approximately 
200 
μs 
(2000 × 
100ns).
• Multiplying two n × n matrices takes 2n
3
 operations.
For our 
problem, this corresponds to 64K operations,
which can be performed in 16K cycles (or 
16 
μs
) at
four instructions per cycle.
Prof Vijay More, MET, IOE, BKC, Nashik
• The total time for the computation is
approximately
= 
 
time for load/store operations +
 
time for the computation itself,
= (200 + 16)
μs. = 216 μs
• This corresponds to a peak computation
rate of 64K FLOP/216μs 
or 303 MFLOPS.
Prof Vijay More, MET, IOE, BKC, Nashik
Theoretical Peak MFLOPS
How many numerical operations can perform in one
second by the machine without performing other task.
Time required to compute one operation, how many of
them can be performed in one second.
Ex. If there are 8 cycles
needed to perform one
numeric multiplication
and cycle time is 20 ns
then it requires 160 ns to
perform one multiplication
for one second:
= 6.25 x 10
6
multiplications / second
= 6.25 MFLOPS
1,000,000,000
-----------------
            1 sec
  1 multiplica 
 ----------------     X
        160 ns 
Prof Vijay More, MET, IOE, BKC, Nashik
Impact of Caches
• In our example,
we had O(n
2
) 
 
data accesses and
O(n
3
) 
computation.
which is desirable situation for caches.
Prof Vijay More, MET, IOE, BKC, Nashik
Increase in
Capacity and
Access time
Increase in
Cost per bit
Capacity
Registers in CPU
Cache
sRAM
Main Memory
dRAM
Disk Storage
(Solid state, magnetic)
Backup Storage
(magnetic tape, optical disk)
Level 0
Level 1
Level 2
Level 3
Level 4
Prof Vijay More, MET, IOE, BKC, Nashik
CPU
Registers
a
b
Segment G
Segment F
Segment F
Segment G
M4: Magnetic
Tape unit
(backup storage)
3: Access by Page
(1KByte) from file
segment e.g. Segment F
2: Access by block (32
byte) from memory page
of 32 blocks, e.g. Block b
from Page B
1: Access by word (4, 8
byte) from a cache block
of 32 bytes, e.g. Block a
M1: Cache
M2: Main
Memory
M3: Disk
Storage
Inclusion
property & Data
transfer between
adjacent levels of
memory
hierarchy
Prof Vijay More, MET, IOE, BKC, Nashik
Inclusion, Coherence, and Locality properties
Information stored in memory hierarchy satisfy
these three properties inclusion, coherence,
and locality.
Cache memory at innermost level M1 directly
communicates with registers of the CPU.
All information words are stored at outermost
level Mn
Prof Vijay More, MET, IOE, BKC, Nashik
The set inclusion relationship is:
M
1
 

M
2

M
3

M
n
This shows that all information items are
originally stored in outermost level M
n
.
At the time of processing, subset of M
n
 are
copied into M
n-1
,
Similarly, subset of M
n-1
 are copied into M
n-2
,
and so on.
Prof Vijay More, MET, IOE, BKC, Nashik
If information word is found in M
i
, then copies
of the same word can also be found in all upper
levels M
i+1
, M
i+2
,....,M
n
.
Whereas, word found in M
i+1
 may not be found
in M
i
.
M
1
M
i-1
M
i
M
i+1
M
n
Prof Vijay More, MET, IOE, BKC, Nashik
A 
word miss 
in M
i 
indicates that it is also missing
from all lower levels M
i-1
, M
i-2
,....,M
1
.
Highest level is the backup storage where
everything can be found.
Information transfer between CPU and cache is
in terms of words (8, 16 bytes).
The cache (M
1
) is divided into cache blocks,
called cache lines.
Each block is 32 bytes. (e.g. blocks a, b in figure)
Prof Vijay More, MET, IOE, BKC, Nashik
Main memory (M
2
) is divided into pages
(4KBytes each).
Each page has 128 blocks.
Pages are the unit of transfer between main
memory and disk storage.
Cache
CPU
Registers
Main
Memory
Disk
Storage
4-8 bytes
Per word
transfer
32 bytes
block
transfer
4 K bytes
Page
transfer
file
Transfer
512 KBytes
Scattered pages are organized in the form of
segment in disk memory. The size of segment
varies depends on user needs.
Prof Vijay More, MET, IOE, BKC, Nashik
Coherence Property:
Copy of same information item at successive level
must be consistent.
If the word is modified at Cache, must immediately
be updated at all higher levels. The hierarchy
should be maintained in such way.
Frequently used information is often available at
lower level to minimize access time.
Prof Vijay More, MET, IOE, BKC, Nashik
Coherence is maintained in two ways.
i.
Write-through (WT)
 
demands immediate update in M
i+1
 when word
in M
i
 is updated.
ii.
Write-back (WB)
 
which delays the update in M
i+1
 when the word
in M
i
 is modified.
Prof Vijay More, MET, IOE, BKC, Nashik
Locality of Reference Property:
Particular portion of memory address space is
accessed frequently by a program during its
execution in any time window. E.g. Innermost loop.
There are three dimensions of locality property.
1.
Temporal locality
2.
Spatial locality
3.
Sequential locality
Prof Vijay More, MET, IOE, BKC, Nashik
1.
Temporal Locality
Items recently referred are likely to be referenced
in near future by loop, stack, temporary
variables, or subroutines, etc.
Once a loop is entered or subroutine is called,
a small code segment is referenced many times
repeatedly.
This temporal locality is clustered in recently used
areas.
Prof Vijay More, MET, IOE, BKC, Nashik
2.
 
Spatial Locality
It is the tendency of a process to access items
whose addresses are near to each other, e.g.
tables, arrays involves access to special areas
which are clustered together.
Program segment containing subroutines and
macros are kept together in the neighbourhood
of memory space.
Prof Vijay More, MET, IOE, BKC, Nashik
3. 
 
Sequential Locality
Execution of instructions in a program follow a
sequential order unless out-of-order
instructions encountered.
The ratio of in-order execution to out-of-order
execution is generally 5 to 1.
Prof Vijay More, MET, IOE, BKC, Nashik
Hit Ratio
If desired information is found in M
i
  
- hit
If desired information is not found in M
i
 
- miss
The hit ratio h
i
 at M
i
 is the probability that
information item will be found in M
i
Miss is defined as (1 – h
i
)
Hit ratio of successive levels is the function of
memory capacity, program behaviour, etc.
Assume     h
0
  = 0    and h
n
 = 1
Prof Vijay More, MET, IOE, BKC, Nashik
Access Frequency
The access frequency to M
i
 is f
i
where,
f
i
  = (1-h
1
) (1-h
2
). . . . (1-h
i-1
) h
i
is the probability of sequential access to M
i
when there are i-1 misses at the lower levels
and hit at M
i
Prof Vijay More, MET, IOE, BKC, Nashik
and f
1
 = h
1
Access frequency decreases rapidly due to
locality property
i.e.
f
1
  

f
2
  

f
n
also indicates that, innermost memory level
are accessed frequently than outer levels.
Prof Vijay More, MET, IOE, BKC, Nashik
Effective Access Time
Misses
 are called
  
Block misses
 
at
 
cache
  
Page fault
  
at
 
memory
Time penalty in page fault is larger than in block
misses.
Cache miss is 2 to 4 times costly than cache hit.
But, page fault is 1000 to 10000 times costly as a
page hit.
Prof Vijay More, MET, IOE, BKC, Nashik
The effective access time
T
eff
  
=
 
i=1
n
  fi
 .  ti
  
=
 
h
1
.t
1
 + (1-h
1
)h
2
t
2
 +
   
(1-h
1
)(1-h
2
)h
3
t
3
 + . . . . +
   
(1-h
1
)(1-h
2
)...(1-h
n-1
).t
n
Still effective access time depends on program
behaviour an memory hierarchy design
Prof Vijay More, MET, IOE, BKC, Nashik
Hierarchical Optimization
Total Cost
Since, Cost
C
1
 > C
2
 > . . . . > C
n
,
we need to select size (capacity)
s
1
 < s
2
 < . . . . < s
n
Memory hierarchy should result in a T
eff
(effective access time) close to t
1
 of M
1
.
Prof Vijay More, MET, IOE, BKC, Nashik
The effective access time
T
eff
  
=
 
i=1
n
  fi
 .  ti
The optimized effective access time should be
minimized and close to t
1
.
C
total
 must be less than C
0,
where C
0
 is the ceiling on total cost.
There is tradeoff at all levels, and this represents
non-deterministic problem. (LPP).
Prof Vijay More, MET, IOE, BKC, Nashik
Example: Design a three level memory hierarchy for given
parameters.
Goal is to achieve effective access time t = 850 ns
cache hit ratio h
1
 = 0.98, h
2
 = 0.99 in main memory C
0
 is
$1500.
Prof Vijay More, MET, IOE, BKC, Nashik
Effective access time
t = h
1
.t
1
 + (1-h
1
)h
2
t
2
 + (1-h
1
)(1-h
2
)h
3
t
3
850x10
-9
 = 
 
0.98x25x10
-9
 + (1-0.98)x0.99x
t
2
 +
   
(1-0.98)(1-0.99)x1x4x10
-3
t
2
 = 1250ns
1500 = 0.12x512 + 0.02x32x10
3
 + 0.00002x
s
3
s
3
 = 40 GBytes max
Prof Vijay More, MET, IOE, BKC, Nashik
Prof Vijay More, MET, IOE, BKC, Nashik
Memory design implications
Temporal locality uses least recently used (LRU)
replacement algorithm.
Spatial locality determines size of unit data transfer
between adjacent memory levels.
Sequential locality affect grain packing for optimal
scheduling. Prefetch are heavily used.
Prof Vijay More, MET, IOE, BKC, Nashik
Impact of Memory Bandwidth
 
Memory bandwidth is determined by the
bandwidth of the 
memory bus 
as well as the
memory units
.
 
Memory bandwidth can be improved by
increasing the size of memory blocks.
C
o
n
s
i
d
e
r
 
t
h
e
 
s
a
m
e
 
s
e
t
u
p
 
a
s
 
b
e
f
o
r
e
,
 
e
x
c
e
p
t
 
i
n
 
t
h
i
s
c
a
s
e
,
 
t
h
e
 
b
l
o
c
k
 
s
i
z
e
 
i
s
 
4
 
w
o
r
d
s
 
i
n
s
t
e
a
d
 
o
f
 
1
 
w
o
r
d
.
W
e
 
r
e
p
e
a
t
 
t
h
e
 
d
o
t
-
p
r
o
d
u
c
t
 
c
o
m
p
u
t
a
t
i
o
n
 
i
n
 
t
h
i
s
s
c
e
n
a
r
i
o
:
• Assuming that the vectors occupied linearly in
memory, eight 
FLOPs (four multiply-adds) can be
performed in 200 cycles.
Prof Vijay More, MET, IOE, BKC, Nashik
• This is because a single memory access fetches
four 
consecutive words in the vector.
• Therefore, two accesses can fetch four elements
of each of 
the vectors. This corresponds to 1 FLOP
every 25 ns, for a peak speed of 40 MFLOPS.
• It is important to note that increasing block size
does not 
change latency of the system.
• Physically, this scenario can be viewed as a 
wide
data bus (4 words or 128 bits) connected to multiple
memory banks.
• In practice, such wide buses are expensive to
construct.
• In a more practical system, consecutive words are
sent on the 
memory bus on subsequent bus cycles
after the first word is retrieved.
Prof Vijay More, MET, IOE, BKC, Nashik
Data access patterns is one of the factor
which can be used to improve performance.
In spatial locality of reference, 
consecutive
data words in memory were used by
successive instructions .
If we take a data-layout centric view,
computations must be 
reordered 
to enhance
spatial locality of reference.
Prof Vijay More, MET, IOE, BKC, Nashik
Consider the following code fragment:
for (i = 0; i < 1000; i++)
  
column_sum[i] = 0.0;
for (i = 0; i < 1000; i++)
 
for (j = 0; j < 1000; j++)
  
column_sum[i] += b[j][i];
The code fragment sums columns of the matrix b into
a vector 
column_sum
.
int a[2][3];
for(int i=0;i<2;i++){
 
for (int j=0;j<3;j++)
  
cout<<’\t’<<&a[i][j];
 
cout<<endl;
}
-----------------------------------------------------------
0x7fff9987c700 
 
0x7fff9987c704 
 
0x7fff9987c708
0x7fff9987c70c 
 
0x7fff9987c710 
 
0x7fff9987c714
Example for
quick reference
of data layout
Prof Vijay More, MET, IOE, BKC, Nashik
We can fix the above code as follows:
for (i = 0; i < 1000; i++)
 
column_sum[i] = 0.0;
 
for (j = 0; j < 1000; j++)
  
for (i = 0; i < 1000; i++)
   
column_sum[i] += b[j][i];
In this case, the matrix is translated in a row-order and
performance can be expected to be significantly better.
Prof Vijay More, MET, IOE, BKC, Nashik
Slide Note
Embed
Share

Explore the implications of implicit parallelism in microprocessor architectures, addressing performance bottlenecks in processor, memory system, and datapath components. Prof. Vijay More delves into optimizing resource utilization, diverse architectural executions, and the impact on current computer systems. Discover the scope and varied applications of parallelism in modern technology.

  • Parallelism
  • Microprocessor Architectures
  • Performance Bottlenecks
  • Resource Utilization
  • Implicit Parallelism

Uploaded on Oct 03, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Parallel Programming Platforms Implicit Parallelism: Trends in Microprocessor & Architectures, Limitations of Memory System Performance Prof Vijay More, MET, IOE, BKC, Nashik

  2. Scope of Parallelism Conventional architectures comprise of a processor, memory system, and the datapath. Each of these components presents considerable performance bottlenecks. Parallelism addresses each of these components in significant ways. Prof Vijay More, MET, IOE, BKC, Nashik

  3. Different applications utilize different aspects of parallelism since their requirement is different e.g., o data intensive applications utilize high aggregate throughput, o server applications utilize high aggregate network bandwidth, o and scientific applications typically utilize high processing and memory system performance. It is important to understand each of these performance bottlenecks. Prof Vijay More, MET, IOE, BKC, Nashik

  4. Implicit Parallelism: Trends in Microprocessor Architectures Microprocessor clock speeds have recorded impressive gains over the past two decades (two to three orders of magnitude). Higher levels of device integration have made available a large number of transistors. Prof Vijay More, MET, IOE, BKC, Nashik

  5. How best to utilize these resources?. Current processors use these resources in multiple functional units and execute multiple instructions in the same cycle. The precise manner in which these instructions are selected and executed provides remarkable diversity in architectures. Increased overhead of compiler to find out parallelism from the code if it is implicit. Prof Vijay More, MET, IOE, BKC, Nashik

  6. Current Generation Computer Systems 1 BlueGene/L (IBM / LLNL) 2 Purple (IBM / LLNL) 3 Q (HP / LANL) 4 Red Storm (Cray / SNL) 5 X-1 (Cray / ORNL) 6 Earth Simulator (NEC / Japan) 7 Linux Clusters (Various Developers / Various Sites) 8 System X (Apple / Virginia Tech) 9 Columbia (SGI / NASA) 10 MareNostrum (IBM / Spain) Prof Vijay More, MET, IOE, BKC, Nashik

  7. Limitations of Memory System Performance Memory system, (and not processor speed), is often the bottleneck for many applications. Memory system performance is largely affected by two parameters, latency and bandwidth. Latency is the time from the issue of a memory request to the time the data is available at the processor. Bandwidth is the rate at which data can be transferred between processor and memory system. Prof Vijay More, MET, IOE, BKC, Nashik

  8. Memory System Performance: Bandwidth and Latency Consider the example of a fire-hose. If the water comes outof the hose two seconds after the hydrant is turned on, thelatency of the system is two seconds. Once the water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second, the bandwidth of the system is 5 gallons/second. If you want immediate response from the hydrant, it is important to reduce latency. If you want to fight big fires, you want high bandwidth. Prof Vijay More, MET, IOE, BKC, Nashik

  9. Memory Latency: An Example Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns (without caches). Assume that the processor has two multiply-add units and is capable of executing four instructions in each cycle of 1 ns. The following observations follow: The peak processor rating is 4 GFLOPS. Since the memory latency is 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data. Prof Vijay More, MET, IOE, BKC, Nashik

  10. Memory Latency: An Example On the above architecture, consider the problem of computing a dot-product of two vectors. A dot-product computation performs one multiply- add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. It follows that the peak speed of this computation is limited to one floating point operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak processor rating! Prof Vijay More, MET, IOE, BKC, Nashik

  11. FLOPS Floating point operations per second. Floating point operations are heavily used by science and engineering computational applications. This is one of the measure of performance for the system which is abbreviated as flops with prefix: mega 106 megaflops MFLOPS giga 109 gigaflops GFLOPS tera 1012 teraflops TFLOPS peta 1015 petaflops PFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

  12. MFLOPS megaflops also called millions of floating point operations per second. MFLOPS rating is computed by dividing number of operations by execution time: Ex. multiplication of nxn matrix number of operations are 2n3-n n2 inner products with n multiplications and n-1 additions in each product. if 100x100 matrix multiplication computed in 0.35 seconds, then MFLOPS rate = (2 (100)3 -100) / 0.35 = 5,714,000 ops per second = 5.714 MFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

  13. Improving Effective Memory Latency Using Caches Caches are small and fast memory elements between the processor and DRAM. Important property is low-latency high-bandwidth storage. If a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. The fraction of data references satisfied by the cache is called the cache hit ratio of the computation on the system. Prof Vijay More, MET, IOE, BKC, Nashik

  14. Impact of Caches: Example Consider the architecture from the previous example. In this case, we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. We use this setup to multiply two matrices A and B of dimensions 32 32. We have carefully chosen these numbers so that the cache is large enough to store matrices A and B, as well as the result matrix C. Prof Vijay More, MET, IOE, BKC, Nashik

  15. Impact of Caches: Example (continued) Observations: Fetching the two matrices into the cache corresponds to fetching 2K words, which takes approximately 200 s (2000 100ns). Multiplying two n n matrices takes 2n3 operations. For our problem, this corresponds to 64K operations, which can be performed in 16K cycles (or 16 s) at four instructions per cycle. Prof Vijay More, MET, IOE, BKC, Nashik

  16. The total time for the computation is approximately = time for load/store operations + time for the computation itself, = (200 + 16) s. = 216 s This corresponds to a peak computation rate of 64K FLOP/216 s or 303 MFLOPS. Prof Vijay More, MET, IOE, BKC, Nashik

  17. Theoretical Peak MFLOPS How many numerical operations can perform in one second by the machine without performing other task. Time required to compute one operation, how many of them can be performed in one second. Ex. If there are 8 cycles needed to perform one numeric multiplication for one second: 1 multiplica ---------------- X 160 ns 1,000,000,000 ----------------- 1 sec and cycle time is 20 ns = 6.25 x 106 multiplications / second then it requires 160 ns to perform one multiplication = 6.25 MFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

  18. Impact of Caches In our example, we had O(n2) O(n3) computation. which is desirable situation for caches. data accesses and Prof Vijay More, MET, IOE, BKC, Nashik

  19. Registers in CPU Level 0 Increase in Capacity and Access time Increase in Cost per bit Cache sRAM Level 1 Main Memory dRAM Level 2 Disk Storage (Solid state, magnetic) Level 3 Level 4 Backup Storage (magnetic tape, optical disk) Capacity Prof Vijay More, MET, IOE, BKC, Nashik

  20. CPU 1: Access by word (4, 8 byte) from a cache block of 32 bytes, e.g. Block a Inclusion property & Data transfer between adjacent levels of memory hierarchy Registers b a M1: Cache 2: Access by block (32 byte) from memory page of 32 blocks, e.g. Block b from Page B Page B b M2: Main Memory 3: Access by Page (1KByte) from file segment e.g. Segment F a Page A Segment G a Segment F M3: Disk Storage Page A Page B b Segment F M4: Magnetic Tape unit (backup storage) a Page B Page A b Segment G Prof Vijay More, MET, IOE, BKC, Nashik

  21. Inclusion, Coherence, and Locality properties Information stored in memory hierarchy satisfy these three properties inclusion, coherence, and locality. Cache memory at innermost level M1 directly communicates with registers of the CPU. All information words are stored at outermost level Mn Prof Vijay More, MET, IOE, BKC, Nashik

  22. The set inclusion relationship is: M1 M2 M3 Mn This shows that all information items are originally stored in outermost level Mn. At the time of processing, subset of Mn are copied into Mn-1, Similarly, subset of Mn-1 are copied into Mn-2, and so on. Prof Vijay More, MET, IOE, BKC, Nashik

  23. If information word is found in Mi, then copies of the same word can also be found in all upper levels Mi+1, Mi+2,....,Mn. Whereas, word found in Mi+1 may not be found in Mi. M1 Mi-1 Mi Mi+1 Mn Prof Vijay More, MET, IOE, BKC, Nashik

  24. A word miss in Mi indicates that it is also missing from all lower levels Mi-1, Mi-2,....,M1. Highest level is the backup storage where everything can be found. Information transfer between CPU and cache is in terms of words (8, 16 bytes). The cache (M1) is divided into cache blocks, called cache lines. Each block is 32 bytes. (e.g. blocks a, b in figure) Prof Vijay More, MET, IOE, BKC, Nashik

  25. Main memory (M2) is divided into pages (4KBytes each). Each page has 128 blocks. Pages are the unit of transfer between main memory and disk storage. Main Memory CPU Registers Disk Storage Cache 4-8 bytes Per word transfer 4 K bytes Page transfer file Transfer 512 KBytes 32 bytes block transfer Scattered pages are organized in the form of segment in disk memory. The size of segment varies depends on user needs. Prof Vijay More, MET, IOE, BKC, Nashik

  26. Coherence Property: Copy of same information item at successive level must be consistent. If the word is modified at Cache, must immediately be updated at all higher levels. The hierarchy should be maintained in such way. Frequently used information is often available at lower level to minimize access time. Prof Vijay More, MET, IOE, BKC, Nashik

  27. Coherence is maintained in two ways. i. Write-through (WT) demands immediate update in Mi+1 when word in Mi is updated. ii. Write-back (WB) which delays the update in Mi+1 when the word in Mi is modified. Prof Vijay More, MET, IOE, BKC, Nashik

  28. Locality of Reference Property: Particular portion of memory address space is accessed frequently by a program during its execution in any time window. E.g. Innermost loop. There are three dimensions of locality property. 1. Temporal locality 2. Spatial locality 3. Sequential locality Prof Vijay More, MET, IOE, BKC, Nashik

  29. 1. Temporal Locality Items recently referred are likely to be referenced in near future by loop, stack, temporary variables, or subroutines, etc. Once a loop is entered or subroutine is called, a small code segment is referenced many times repeatedly. This temporal locality is clustered in recently used areas. Prof Vijay More, MET, IOE, BKC, Nashik

  30. 2. Spatial Locality It is the tendency of a process to access items whose addresses are near to each other, e.g. tables, arrays involves access to special areas which are clustered together. Program segment containing subroutines and macros are kept together in the neighbourhood of memory space. Prof Vijay More, MET, IOE, BKC, Nashik

  31. 3. Sequential Locality Execution of instructions in a program follow a sequential order unless out-of-order instructions encountered. The ratio of in-order execution to out-of-order execution is generally 5 to 1. Prof Vijay More, MET, IOE, BKC, Nashik

  32. Hit Ratio If desired information is found in Mi If desired information is not found in Mi - hit - miss The hit ratio hi at Mi is the probability that information item will be found in Mi Miss is defined as (1 hi) Hit ratio of successive levels is the function of memory capacity, program behaviour, etc. Assume h0 = 0 and hn = 1 Prof Vijay More, MET, IOE, BKC, Nashik

  33. Access Frequency The access frequency to Mi is fi where, fi = (1-h1) (1-h2). . . . (1-hi-1) hi is the probability of sequential access to Mi when there are i-1 misses at the lower levels and hit at Mi Prof Vijay More, MET, IOE, BKC, Nashik

  34. n = i = 1 if and f1 = h1 1 Access frequency decreases rapidly due to locality property i.e. f1 f2 fn also indicates that, innermost memory level are accessed frequently than outer levels. Prof Vijay More, MET, IOE, BKC, Nashik

  35. Effective Access Time Misses are called Block misses Page fault at at cache memory Time penalty in page fault is larger than in block misses. Cache miss is 2 to 4 times costly than cache hit. But, page fault is 1000 to 10000 times costly as a page hit. Prof Vijay More, MET, IOE, BKC, Nashik

  36. The effective access time i=1n fi . ti Teff = = h1.t1 + (1-h1)h2t2 + (1-h1)(1-h2)h3t3 + . . . . + (1-h1)(1-h2)...(1-hn-1).tn Still effective access time depends on program behaviour an memory hierarchy design Prof Vijay More, MET, IOE, BKC, Nashik

  37. Hierarchical Optimization n Total Cost = i = C C s total i i 1 Since, Cost C1 > C2 > . . . . > Cn, we need to select size (capacity) s1 < s2 < . . . . < sn Memory hierarchy should result in a Teff (effective access time) close to t1 of M1. Prof Vijay More, MET, IOE, BKC, Nashik

  38. The effective access time i=1n fi . ti Teff = The optimized effective access time should be minimized and close to t1. Ctotal must be less than C0, where C0 is the ceiling on total cost. There is tradeoff at all levels, and this represents non-deterministic problem. (LPP). Prof Vijay More, MET, IOE, BKC, Nashik

  39. Example: Design a three level memory hierarchy for given parameters. Memory Level Access Time Capacity Cost / K byte Cache t1 = 25ns s1 = 512 Kbyte c1 = $0.12 Main memory t2 = unknown s2 = 32 Mbytes c2 = $0.02 Disk t3 = 4ms s3 = unknown c3 = $0.00002 Goal is to achieve effective access time t = 850 ns cache hit ratio h1 = 0.98, h2 = 0.99 in main memory C0 is $1500. Prof Vijay More, MET, IOE, BKC, Nashik

  40. Effective access time t = h1.t1 + (1-h1)h2t2 + (1-h1)(1-h2)h3t3 850x10-9 = 0.98x25x10-9 + (1-0.98)x0.99xt2 + (1-0.98)(1-0.99)x1x4x10-3 t2 = 1250ns 1500 = 0.12x512 + 0.02x32x103 + 0.00002xs3 s3 = 40 GBytes max Prof Vijay More, MET, IOE, BKC, Nashik

  41. Prof Vijay More, MET, IOE, BKC, Nashik

  42. Memory design implications Temporal locality uses least recently used (LRU) replacement algorithm. Spatial locality determines size of unit data transfer between adjacent memory levels. Sequential locality affect grain packing for optimal scheduling. Prefetch are heavily used. Prof Vijay More, MET, IOE, BKC, Nashik

  43. Impact of Memory Bandwidth Memory bandwidth is determined by the bandwidth of the memory bus as well as the memory units. Memory bandwidth can be improved by increasing the size of memory blocks. Consider the same setup as before, except in this case, the block size is 4 words instead of 1 word. We repeat the dot-product computation in this scenario: Assuming that the vectors occupied linearly in memory, eight FLOPs (four multiply-adds) can be performed in 200 cycles. Prof Vijay More, MET, IOE, BKC, Nashik

  44. This is because a single memory access fetches four consecutive words in the vector. Therefore, two accesses can fetch four elements of each of the vectors. This corresponds to 1 FLOP every 25 ns, for a peak speed of 40 MFLOPS. It is important to note that increasing block size does not change latency of the system. Physically, this scenario can be viewed as a wide data bus (4 words or 128 bits) connected to multiple memory banks. In practice, such wide buses are expensive to construct. In a more practical system, consecutive words are sent on the memory bus on subsequent bus cycles after the first word is retrieved. Prof Vijay More, MET, IOE, BKC, Nashik

  45. Data access patterns is one of the factor which can be used to improve performance. In spatial locality of reference, consecutive data words in memory were used by successive instructions . If we take a data-layout centric view, computations must be reordered to enhance spatial locality of reference. Prof Vijay More, MET, IOE, BKC, Nashik

  46. Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (i = 0; i < 1000; i++) for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vector column_sum. int a[2][3]; for(int i=0;i<2;i++){ for (int j=0;j<3;j++) cout<< \t <<&a[i][j]; cout<<endl; } ----------------------------------------------------------- 0x7fff9987c700 0x7fff9987c704 0x7fff9987c708 0x7fff9987c70c 0x7fff9987c710 0x7fff9987c714 Example for quick reference of data layout Prof Vijay More, MET, IOE, BKC, Nashik

  47. We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is translated in a row-order and performance can be expected to be significantly better. Prof Vijay More, MET, IOE, BKC, Nashik

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#