Trends in Implicit Parallelism and Microprocessor Architectures

Parallel Programming Platforms

Implicit Parallelism:

Trends in Microprocessor & Architectures,

Limitations of Memory System Performance

Prof Vijay More, MET, IOE, BKC, Nashik

•

Conventional architectures comprise of a

processor, memory system, and the datapath.

•

Each of these components presents

considerable performance bottlenecks.

•

Parallelism addresses each of these components

in significant ways.

Prof Vijay More, MET, IOE, BKC, Nashik

•

Different applications utilize different aspects

of parallelism since their requirement is

different

–

e.g.,

data intensive applications utilize high

aggregate throughput,

server applications utilize high aggregate

network bandwidth,

and scientific applications typically utilize

high processing and memory system

performance.

It is important to understand each of these

performance bottlenecks.

Prof Vijay More, MET, IOE, BKC, Nashik

•

Microprocessor clock speeds have recorded

impressive gains over the past two decades

(two to three orders of magnitude).

•

Higher levels of device integration have

made available a large number of

transistors.

Prof Vijay More, MET, IOE, BKC, Nashik

How best to utilize these resources?.

•

Current processors use these resources in

multiple functional units

and

execute

multiple instructions in the same cycle

•

The precise manner in which these

instructions are selected and executed

provides remarkable diversity in

architectures.

•

Increased

 overhead of compiler to find ou

parallelism from the code if it is implicit.

Prof Vijay More, MET, IOE, BKC, Nashik

Current Generation Computer Systems

1 BlueGene/L (IBM / LLNL)

2 Purple (IBM / LLNL)

3 Q (HP / LANL)

4 Red Storm (Cray / SNL)

5 X-1 (Cray / ORNL)

6 Earth Simulator (NEC / Japan)

7 Linux Clusters (Various Developers / Various Sites)

8 System X (Apple / Virginia Tech)

9 Columbia (SGI / NASA)

10 MareNostrum (IBM / Spain)

Prof Vijay More, MET, IOE, BKC, Nashik

•

Memory system, (and not processor speed), is

often the bottleneck for many applications.

•

Memory system performance is largely affected by

two parameters,

latency

and

bandwidth

•

Latency

 is the time from the issue of a memory

request to the time the data is available at the

processor.

•

Bandwidth

 is the rate at which data can be

transferred

 between

processor and memory

system.

Prof Vijay More, MET, IOE, BKC, Nashik

•

Consider the example of a fire-hose. If the water

comes out

of the hose two seconds after the

hydrant is turned on, the

latency of the system is

two seconds.

•

Once the water starts flowing, if the hydrant

delivers water at the rate of 5 gallons/second, the

bandwidth of the system is 5 gallons/second.

•

If you want immediate response from the hydrant,

it is important to reduce latency.

•

If you want to fight big fires, you want high

bandwidth

Prof Vijay More, MET, IOE, BKC, Nashik

Consider a processor operating at 1 GHz (1 ns

clock) connected to a DRAM with a latency of 100 ns

(without caches).

Assume that the processor has two multiply-add

units and is capable of executing four instructions in

each cycle of 1 ns.

The following observations follow:

•

The peak processor rating is 4 GFLOPS.

•

Since the memory latency is 100 cycles and block

size is one word, every time a memory request is

made, the processor must wait 100 cycles before it

can process the data.

Prof Vijay More, MET, IOE, BKC, Nashik

On the above architecture, consider the problem of

computing a dot-product of two vectors.

•

A dot-product computation performs one multiply-

add on a single pair of vector elements, i.e., each

floating point operation requires one data fetch.

•

It follows that the peak speed of this computation is

limited to

one floating point operation every 100 ns

or a speed of 10 MFLOPS, a very small fraction of

the peak processor rating!

Prof Vijay More, MET, IOE, BKC, Nashik

FLOPS

Floating point operations per second.

Floating point operations are heavily used by

science and engineering computational

applications. This is one of the measure of

performance for the system which is abbreviated

as

flops

 with prefix:

mega 10

 megaflops MFLOPS

giga 10

 gigaflops GFLOPS

tera 10

 teraflops TFLOPS

peta 10

 petaflops PFLOPS

Prof Vijay More, MET, IOE, BKC, Nashik

MFLOPS

megaflops also called millions of floating point operations

per second. MFLOPS rating is computed by dividing

number of operations by execution time:

Ex. multiplication of nxn matrix

number of operations are 2n

-n

 inner products with n multiplications and n-1 additions

in each product.

if 100x100 matrix multiplication computed in 0.35

seconds, then

MFLOPS rate = (2 (100)

 -100) / 0.35 = 5,714,000 ops per

second = 5.714 MFLOPS

Prof Vijay More, MET, IOE, BKC, Nashik

•

Caches are small and fast memory elements

between the processor and DRAM.

•

Important property is

low-latency high-bandwidth

storage

•

If a piece of data is repeatedly used, the effective

latency of this memory system can be reduced by

the cache.

•

The fraction of data references satisfied by the

cache is called the cache

hit ratio

of the computation

on the system.

Prof Vijay More, MET, IOE, BKC, Nashik

Consider the architecture from the previous example.

In this case, we introduce a cache of size 32 KB with

a latency of 1 ns or one cycle.

We use this setup to multiply two matrices

and

of

dimensions

×

. We have carefully chosen these

numbers so that the cache is large enough to store

matrices

and

, as well as the result matrix

Prof Vijay More, MET, IOE, BKC, Nashik

Observations:

• Fetching the two matrices into the cache

corresponds to

fetching 2K words, which takes

approximately

μs

(2000 ×

100ns).

• Multiplying two n × n matrices takes 2n

 operations.

For our

problem, this corresponds to 64K operations,

which can be performed in 16K cycles (or

μs

) at

four instructions per cycle.

Prof Vijay More, MET, IOE, BKC, Nashik

• The total time for the computation is

approximately

time for load/store operations +

time for the computation itself,

= (200 + 16)

μs. = 216 μs

• This corresponds to a peak computation

rate of 64K FLOP/216μs

or 303 MFLOPS.

Prof Vijay More, MET, IOE, BKC, Nashik

Theoretical Peak MFLOPS

How many numerical operations can perform in one

second by the machine without performing other task.

Time required to compute one operation, how many of

them can be performed in one second.

Ex. If there are 8 cycles

needed to perform one

numeric multiplication

and cycle time is 20 ns

then it requires 160 ns to

perform one multiplication

for one second:

= 6.25 x 10

multiplications / second

= 6.25 MFLOPS

1,000,000,000

-----------------

            1 sec

  1 multiplica

 ----------------     X

        160 ns

Prof Vijay More, MET, IOE, BKC, Nashik

Impact of Caches

• In our example,

we had O(n

data accesses and

O(n

computation.

which is desirable situation for caches.

Prof Vijay More, MET, IOE, BKC, Nashik

Increase in

Capacity and

Access time

Increase in

Cost per bit

Capacity

Registers in CPU

Cache

sRAM

Main Memory

dRAM

Disk Storage

(Solid state, magnetic)

Backup Storage

(magnetic tape, optical disk)

Level 0

Level 1

Level 2

Level 3

Level 4

Prof Vijay More, MET, IOE, BKC, Nashik

CPU

Registers

Segment G

Segment F

Segment F

Segment G

M4: Magnetic

Tape unit

(backup storage)

3: Access by Page

(1KByte) from file

segment e.g. Segment F

2: Access by block (32

byte) from memory page

of 32 blocks, e.g. Block b

from Page B

1: Access by word (4, 8

byte) from a cache block

of 32 bytes, e.g. Block a

M1: Cache

M2: Main

Memory

M3: Disk

Storage

Inclusion

property & Data

transfer between

adjacent levels of

memory

hierarchy

Prof Vijay More, MET, IOE, BKC, Nashik

Inclusion, Coherence, and Locality properties

Information stored in memory hierarchy satisfy

these three properties inclusion, coherence,

and locality.

Cache memory at innermost level M1 directly

communicates with registers of the CPU.

All information words are stored at outermost

level Mn

Prof Vijay More, MET, IOE, BKC, Nashik

The set inclusion relationship is:







This shows that all information items are

originally stored in outermost level M

At the time of processing, subset of M

are

copied into M

n-1

Similarly, subset of M

n-1

 are copied into M

n-2

and so on.

Prof Vijay More, MET, IOE, BKC, Nashik

If information word is found in M

, then copies

of the same word can also be found in all upper

levels M

i+1

, M

i+2

,....,M

Whereas, word found in M

i+1

 may not be found

in M

i-1

i+1

Prof Vijay More, MET, IOE, BKC, Nashik

word miss

in M

indicates that it is also missing

from all lower levels M

i-1

, M

i-2

,....,M

Highest level is the backup storage where

everything can be found.

Information transfer between CPU and cache is

in terms of words (8, 16 bytes).

The cache (M

) is divided into cache blocks,

called cache lines.

Each block is 32 bytes. (e.g. blocks a, b in figure)

Prof Vijay More, MET, IOE, BKC, Nashik

Main memory (M

) is divided into pages

(4KBytes each).

Each page has 128 blocks.

Pages are the unit of transfer between main

memory and disk storage.

Cache

CPU

Registers

Main

Memory

Disk

Storage

4-8 bytes

Per word

transfer

32 bytes

block

transfer

4 K bytes

Page

transfer

file

Transfer

512 KBytes

Scattered pages are organized in the form of

segment in disk memory. The size of segment

varies depends on user needs.

Prof Vijay More, MET, IOE, BKC, Nashik

Coherence Property:

Copy of same information item at successive level

must be consistent.

If the word is modified at Cache, must immediately

be updated at all higher levels. The hierarchy

should be maintained in such way.

Frequently used information is often available at

lower level to minimize access time.

Prof Vijay More, MET, IOE, BKC, Nashik

Coherence is maintained in two ways.

i.

Write-through (WT)

demands immediate update in M

i+1

 when word

in M

 is updated.

ii.

Write-back (WB)

which delays the update in M

i+1

 when the word

in M

 is modified.

Prof Vijay More, MET, IOE, BKC, Nashik

Locality of Reference Property:

Particular portion of memory address space is

accessed frequently by a program during its

execution in any time window. E.g. Innermost loop.

There are three dimensions of locality property.

1.

Temporal locality

2.

Spatial locality

3.

Sequential locality

Prof Vijay More, MET, IOE, BKC, Nashik

1.

Temporal Locality

Items recently referred are likely to be referenced

in near future by loop, stack, temporary

variables, or subroutines, etc.

Once a loop is entered or subroutine is called,

a small code segment is referenced many times

repeatedly.

This temporal locality is clustered in recently used

areas.

Prof Vijay More, MET, IOE, BKC, Nashik

2.

Spatial Locality

It is the tendency of a process to access items

whose addresses are near to each other, e.g.

tables, arrays involves access to special areas

which are clustered together.

Program segment containing subroutines and

macros are kept together in the neighbourhood

of memory space.

Prof Vijay More, MET, IOE, BKC, Nashik

3.

Sequential Locality

Execution of instructions in a program follow a

sequential order unless out-of-order

instructions encountered.

The ratio of in-order execution to out-of-order

execution is generally 5 to 1.

Prof Vijay More, MET, IOE, BKC, Nashik

Hit Ratio

If desired information is found in M

- hit

If desired information is not found in M

- miss

The hit ratio h

 at M

 is the probability that

information item will be found in M

Miss is defined as (1 – h

Hit ratio of successive levels is the function of

memory capacity, program behaviour, etc.

Assume     h

  = 0    and h

= 1

Prof Vijay More, MET, IOE, BKC, Nashik

Access Frequency

The access frequency to M

 is f

where,

  = (1-h

) (1-h

). . . . (1-h

i-1

) h

is the probability of sequential access to M

when there are i-1 misses at the lower levels

and hit at M

Prof Vijay More, MET, IOE, BKC, Nashik

and f

= h

Access frequency decreases rapidly due to

locality property

i.e.





also indicates that, innermost memory level

are accessed frequently than outer levels.

Prof Vijay More, MET, IOE, BKC, Nashik

Effective Access Time

Misses

 are called

Block misses

at

cache

Page fault

at

memory

Time penalty in page fault is larger than in block

misses.

Cache miss is 2 to 4 times costly than cache hit.

But, page fault is 1000 to 10000 times costly as a

page hit.

Prof Vijay More, MET, IOE, BKC, Nashik

The effective access time

eff



i=1

fi

 .  ti

.t

 + (1-h

)h

(1-h

)(1-h

)h

 + . . . . +

(1-h

)(1-h

)...(1-h

n-1

).t

Still effective access time depends on program

behaviour an memory hierarchy design

Prof Vijay More, MET, IOE, BKC, Nashik

Hierarchical Optimization

Total Cost

Since, Cost

> C

 > . . . . > C

we need to select size (capacity)

< s

 < . . . . < s

Memory hierarchy should result in a T

eff

(effective access time) close to t

 of M

Prof Vijay More, MET, IOE, BKC, Nashik

The effective access time

eff



i=1

fi

 .  ti

The optimized effective access time should be

minimized and close to t

total

 must be less than C

0,

where C

 is the ceiling on total cost.

There is tradeoff at all levels, and this represents

non-deterministic problem. (LPP).

Prof Vijay More, MET, IOE, BKC, Nashik

Example: Design a three level memory hierarchy for given

parameters.

Goal is to achieve effective access time t = 850 ns

cache hit ratio h

 = 0.98, h

 = 0.99 in main memory C

is

$1500.

Prof Vijay More, MET, IOE, BKC, Nashik

Effective access time

t = h

.t

 + (1-h

)h

 + (1-h

)(1-h

)h

850x10

-9

0.98x25x10

-9

 + (1-0.98)x0.99x

(1-0.98)(1-0.99)x1x4x10

-3

 = 1250ns

1500 = 0.12x512 + 0.02x32x10

 + 0.00002x

 = 40 GBytes max

Prof Vijay More, MET, IOE, BKC, Nashik

Prof Vijay More, MET, IOE, BKC, Nashik

Memory design implications

Temporal locality uses least recently used (LRU)

replacement algorithm.

Spatial locality determines size of unit data transfer

between adjacent memory levels.

Sequential locality affect grain packing for optimal

scheduling. Prefetch are heavily used.

Prof Vijay More, MET, IOE, BKC, Nashik

Impact of Memory Bandwidth

•

Memory bandwidth is determined by the

bandwidth of the

memory bus

as well as the

memory units

•

Memory bandwidth can be improved by

increasing the size of memory blocks.

• Assuming that the vectors occupied linearly in

memory, eight

FLOPs (four multiply-adds) can be

performed in 200 cycles.

Prof Vijay More, MET, IOE, BKC, Nashik

• This is because a single memory access fetches

four

consecutive words in the vector.

• Therefore, two accesses can fetch four elements

of each of

the vectors. This corresponds to 1 FLOP

every 25 ns, for a peak speed of 40 MFLOPS.

• It is important to note that increasing block size

does not

change latency of the system.

• Physically, this scenario can be viewed as a

wide

data bus (4 words or 128 bits) connected to multiple

memory banks.

• In practice, such wide buses are expensive to

construct.

• In a more practical system, consecutive words are

sent on the

memory bus on subsequent bus cycles

after the first word is retrieved.

Prof Vijay More, MET, IOE, BKC, Nashik

Data access patterns is one of the factor

which can be used to improve performance.

In spatial locality of reference,

consecutive

data words in memory were used by

successive instructions .

If we take a data-layout centric view,

computations must be

reordered

to enhance

spatial locality of reference.

Prof Vijay More, MET, IOE, BKC, Nashik

Consider the following code fragment:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;

for (i = 0; i < 1000; i++)

for (j = 0; j < 1000; j++)

column_sum[i] += b[j][i];

The code fragment sums columns of the matrix b into

a vector

column_sum

int a[2][3];

for(int i=0;i<2;i++){

for (int j=0;j<3;j++)

cout<<’\t’<<&a[i][j];

cout<<endl;

-----------------------------------------------------------

0x7fff9987c700

0x7fff9987c704

0x7fff9987c708

0x7fff9987c70c

0x7fff9987c710

0x7fff9987c714

Example for

quick reference

of data layout

Prof Vijay More, MET, IOE, BKC, Nashik

We can fix the above code as follows:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;

for (j = 0; j < 1000; j++)

for (i = 0; i < 1000; i++)

column_sum[i] += b[j][i];

In this case, the matrix is translated in a row-order and

performance can be expected to be significantly better.

Prof Vijay More, MET, IOE, BKC, Nashik

Slide Note

Embed Share

Download

Explore the implications of implicit parallelism in microprocessor architectures, addressing performance bottlenecks in processor, memory system, and datapath components. Prof. Vijay More delves into optimizing resource utilization, diverse architectural executions, and the impact on current computer systems. Discover the scope and varied applications of parallelism in modern technology.

cill156 Follow

Uploaded on Oct 03, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Parallel Programming Platforms Implicit Parallelism: Trends in Microprocessor & Architectures, Limitations of Memory System Performance Prof Vijay More, MET, IOE, BKC, Nashik

Scope of Parallelism Conventional architectures comprise of a processor, memory system, and the datapath. Each of these components presents considerable performance bottlenecks. Parallelism addresses each of these components in significant ways. Prof Vijay More, MET, IOE, BKC, Nashik

Different applications utilize different aspects of parallelism since their requirement is different e.g., o data intensive applications utilize high aggregate throughput, o server applications utilize high aggregate network bandwidth, o and scientific applications typically utilize high processing and memory system performance. It is important to understand each of these performance bottlenecks. Prof Vijay More, MET, IOE, BKC, Nashik

Implicit Parallelism: Trends in Microprocessor Architectures Microprocessor clock speeds have recorded impressive gains over the past two decades (two to three orders of magnitude). Higher levels of device integration have made available a large number of transistors. Prof Vijay More, MET, IOE, BKC, Nashik

How best to utilize these resources?. Current processors use these resources in multiple functional units and execute multiple instructions in the same cycle. The precise manner in which these instructions are selected and executed provides remarkable diversity in architectures. Increased overhead of compiler to find out parallelism from the code if it is implicit. Prof Vijay More, MET, IOE, BKC, Nashik

Current Generation Computer Systems 1 BlueGene/L (IBM / LLNL) 2 Purple (IBM / LLNL) 3 Q (HP / LANL) 4 Red Storm (Cray / SNL) 5 X-1 (Cray / ORNL) 6 Earth Simulator (NEC / Japan) 7 Linux Clusters (Various Developers / Various Sites) 8 System X (Apple / Virginia Tech) 9 Columbia (SGI / NASA) 10 MareNostrum (IBM / Spain) Prof Vijay More, MET, IOE, BKC, Nashik

Limitations of Memory System Performance Memory system, (and not processor speed), is often the bottleneck for many applications. Memory system performance is largely affected by two parameters, latency and bandwidth. Latency is the time from the issue of a memory request to the time the data is available at the processor. Bandwidth is the rate at which data can be transferred between processor and memory system. Prof Vijay More, MET, IOE, BKC, Nashik

Memory System Performance: Bandwidth and Latency Consider the example of a fire-hose. If the water comes outof the hose two seconds after the hydrant is turned on, thelatency of the system is two seconds. Once the water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second, the bandwidth of the system is 5 gallons/second. If you want immediate response from the hydrant, it is important to reduce latency. If you want to fight big fires, you want high bandwidth. Prof Vijay More, MET, IOE, BKC, Nashik

Memory Latency: An Example Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns (without caches). Assume that the processor has two multiply-add units and is capable of executing four instructions in each cycle of 1 ns. The following observations follow: The peak processor rating is 4 GFLOPS. Since the memory latency is 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data. Prof Vijay More, MET, IOE, BKC, Nashik

Memory Latency: An Example On the above architecture, consider the problem of computing a dot-product of two vectors. A dot-product computation performs one multiply- add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. It follows that the peak speed of this computation is limited to one floating point operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak processor rating! Prof Vijay More, MET, IOE, BKC, Nashik

FLOPS Floating point operations per second. Floating point operations are heavily used by science and engineering computational applications. This is one of the measure of performance for the system which is abbreviated as flops with prefix: mega 106 megaflops MFLOPS giga 109 gigaflops GFLOPS tera 1012 teraflops TFLOPS peta 1015 petaflops PFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

MFLOPS megaflops also called millions of floating point operations per second. MFLOPS rating is computed by dividing number of operations by execution time: Ex. multiplication of nxn matrix number of operations are 2n3-n n2 inner products with n multiplications and n-1 additions in each product. if 100x100 matrix multiplication computed in 0.35 seconds, then MFLOPS rate = (2 (100)3 -100) / 0.35 = 5,714,000 ops per second = 5.714 MFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

Improving Effective Memory Latency Using Caches Caches are small and fast memory elements between the processor and DRAM. Important property is low-latency high-bandwidth storage. If a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. The fraction of data references satisfied by the cache is called the cache hit ratio of the computation on the system. Prof Vijay More, MET, IOE, BKC, Nashik

Impact of Caches: Example Consider the architecture from the previous example. In this case, we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. We use this setup to multiply two matrices A and B of dimensions 32 32. We have carefully chosen these numbers so that the cache is large enough to store matrices A and B, as well as the result matrix C. Prof Vijay More, MET, IOE, BKC, Nashik

Impact of Caches: Example (continued) Observations: Fetching the two matrices into the cache corresponds to fetching 2K words, which takes approximately 200 s (2000 100ns). Multiplying two n n matrices takes 2n3 operations. For our problem, this corresponds to 64K operations, which can be performed in 16K cycles (or 16 s) at four instructions per cycle. Prof Vijay More, MET, IOE, BKC, Nashik

The total time for the computation is approximately = time for load/store operations + time for the computation itself, = (200 + 16) s. = 216 s This corresponds to a peak computation rate of 64K FLOP/216 s or 303 MFLOPS. Prof Vijay More, MET, IOE, BKC, Nashik

Theoretical Peak MFLOPS How many numerical operations can perform in one second by the machine without performing other task. Time required to compute one operation, how many of them can be performed in one second. Ex. If there are 8 cycles needed to perform one numeric multiplication for one second: 1 multiplica ---------------- X 160 ns 1,000,000,000 ----------------- 1 sec and cycle time is 20 ns = 6.25 x 106 multiplications / second then it requires 160 ns to perform one multiplication = 6.25 MFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

Impact of Caches In our example, we had O(n2) O(n3) computation. which is desirable situation for caches. data accesses and Prof Vijay More, MET, IOE, BKC, Nashik

Registers in CPU Level 0 Increase in Capacity and Access time Increase in Cost per bit Cache sRAM Level 1 Main Memory dRAM Level 2 Disk Storage (Solid state, magnetic) Level 3 Level 4 Backup Storage (magnetic tape, optical disk) Capacity Prof Vijay More, MET, IOE, BKC, Nashik

CPU 1: Access by word (4, 8 byte) from a cache block of 32 bytes, e.g. Block a Inclusion property & Data transfer between adjacent levels of memory hierarchy Registers b a M1: Cache 2: Access by block (32 byte) from memory page of 32 blocks, e.g. Block b from Page B Page B b M2: Main Memory 3: Access by Page (1KByte) from file segment e.g. Segment F a Page A Segment G a Segment F M3: Disk Storage Page A Page B b Segment F M4: Magnetic Tape unit (backup storage) a Page B Page A b Segment G Prof Vijay More, MET, IOE, BKC, Nashik

Inclusion, Coherence, and Locality properties Information stored in memory hierarchy satisfy these three properties inclusion, coherence, and locality. Cache memory at innermost level M1 directly communicates with registers of the CPU. All information words are stored at outermost level Mn Prof Vijay More, MET, IOE, BKC, Nashik

The set inclusion relationship is: M1 M2 M3 Mn This shows that all information items are originally stored in outermost level Mn. At the time of processing, subset of Mn are copied into Mn-1, Similarly, subset of Mn-1 are copied into Mn-2, and so on. Prof Vijay More, MET, IOE, BKC, Nashik

If information word is found in Mi, then copies of the same word can also be found in all upper levels Mi+1, Mi+2,....,Mn. Whereas, word found in Mi+1 may not be found in Mi. M1 Mi-1 Mi Mi+1 Mn Prof Vijay More, MET, IOE, BKC, Nashik

A word miss in Mi indicates that it is also missing from all lower levels Mi-1, Mi-2,....,M1. Highest level is the backup storage where everything can be found. Information transfer between CPU and cache is in terms of words (8, 16 bytes). The cache (M1) is divided into cache blocks, called cache lines. Each block is 32 bytes. (e.g. blocks a, b in figure) Prof Vijay More, MET, IOE, BKC, Nashik

Main memory (M2) is divided into pages (4KBytes each). Each page has 128 blocks. Pages are the unit of transfer between main memory and disk storage. Main Memory CPU Registers Disk Storage Cache 4-8 bytes Per word transfer 4 K bytes Page transfer file Transfer 512 KBytes 32 bytes block transfer Scattered pages are organized in the form of segment in disk memory. The size of segment varies depends on user needs. Prof Vijay More, MET, IOE, BKC, Nashik

Coherence Property: Copy of same information item at successive level must be consistent. If the word is modified at Cache, must immediately be updated at all higher levels. The hierarchy should be maintained in such way. Frequently used information is often available at lower level to minimize access time. Prof Vijay More, MET, IOE, BKC, Nashik

Coherence is maintained in two ways. i. Write-through (WT) demands immediate update in Mi+1 when word in Mi is updated. ii. Write-back (WB) which delays the update in Mi+1 when the word in Mi is modified. Prof Vijay More, MET, IOE, BKC, Nashik

Locality of Reference Property: Particular portion of memory address space is accessed frequently by a program during its execution in any time window. E.g. Innermost loop. There are three dimensions of locality property. 1. Temporal locality 2. Spatial locality 3. Sequential locality Prof Vijay More, MET, IOE, BKC, Nashik

1. Temporal Locality Items recently referred are likely to be referenced in near future by loop, stack, temporary variables, or subroutines, etc. Once a loop is entered or subroutine is called, a small code segment is referenced many times repeatedly. This temporal locality is clustered in recently used areas. Prof Vijay More, MET, IOE, BKC, Nashik

2. Spatial Locality It is the tendency of a process to access items whose addresses are near to each other, e.g. tables, arrays involves access to special areas which are clustered together. Program segment containing subroutines and macros are kept together in the neighbourhood of memory space. Prof Vijay More, MET, IOE, BKC, Nashik

3. Sequential Locality Execution of instructions in a program follow a sequential order unless out-of-order instructions encountered. The ratio of in-order execution to out-of-order execution is generally 5 to 1. Prof Vijay More, MET, IOE, BKC, Nashik

Hit Ratio If desired information is found in Mi If desired information is not found in Mi - hit - miss The hit ratio hi at Mi is the probability that information item will be found in Mi Miss is defined as (1 hi) Hit ratio of successive levels is the function of memory capacity, program behaviour, etc. Assume h0 = 0 and hn = 1 Prof Vijay More, MET, IOE, BKC, Nashik

Access Frequency The access frequency to Mi is fi where, fi = (1-h1) (1-h2). . . . (1-hi-1) hi is the probability of sequential access to Mi when there are i-1 misses at the lower levels and hit at Mi Prof Vijay More, MET, IOE, BKC, Nashik

n = i = 1 if and f1 = h1 1 Access frequency decreases rapidly due to locality property i.e. f1 f2 fn also indicates that, innermost memory level are accessed frequently than outer levels. Prof Vijay More, MET, IOE, BKC, Nashik

Effective Access Time Misses are called Block misses Page fault at at cache memory Time penalty in page fault is larger than in block misses. Cache miss is 2 to 4 times costly than cache hit. But, page fault is 1000 to 10000 times costly as a page hit. Prof Vijay More, MET, IOE, BKC, Nashik

The effective access time i=1n fi . ti Teff = = h1.t1 + (1-h1)h2t2 + (1-h1)(1-h2)h3t3 + . . . . + (1-h1)(1-h2)...(1-hn-1).tn Still effective access time depends on program behaviour an memory hierarchy design Prof Vijay More, MET, IOE, BKC, Nashik

Hierarchical Optimization n Total Cost = i = C C s total i i 1 Since, Cost C1 > C2 > . . . . > Cn, we need to select size (capacity) s1 < s2 < . . . . < sn Memory hierarchy should result in a Teff (effective access time) close to t1 of M1. Prof Vijay More, MET, IOE, BKC, Nashik

The effective access time i=1n fi . ti Teff = The optimized effective access time should be minimized and close to t1. Ctotal must be less than C0, where C0 is the ceiling on total cost. There is tradeoff at all levels, and this represents non-deterministic problem. (LPP). Prof Vijay More, MET, IOE, BKC, Nashik

Example: Design a three level memory hierarchy for given parameters. Memory Level Access Time Capacity Cost / K byte Cache t1 = 25ns s1 = 512 Kbyte c1 = $0.12 Main memory t2 = unknown s2 = 32 Mbytes c2 = $0.02 Disk t3 = 4ms s3 = unknown c3 = $0.00002 Goal is to achieve effective access time t = 850 ns cache hit ratio h1 = 0.98, h2 = 0.99 in main memory C0 is $1500. Prof Vijay More, MET, IOE, BKC, Nashik

Effective access time t = h1.t1 + (1-h1)h2t2 + (1-h1)(1-h2)h3t3 850x10-9 = 0.98x25x10-9 + (1-0.98)x0.99xt2 + (1-0.98)(1-0.99)x1x4x10-3 t2 = 1250ns 1500 = 0.12x512 + 0.02x32x103 + 0.00002xs3 s3 = 40 GBytes max Prof Vijay More, MET, IOE, BKC, Nashik

Prof Vijay More, MET, IOE, BKC, Nashik

Memory design implications Temporal locality uses least recently used (LRU) replacement algorithm. Spatial locality determines size of unit data transfer between adjacent memory levels. Sequential locality affect grain packing for optimal scheduling. Prefetch are heavily used. Prof Vijay More, MET, IOE, BKC, Nashik

Impact of Memory Bandwidth Memory bandwidth is determined by the bandwidth of the memory bus as well as the memory units. Memory bandwidth can be improved by increasing the size of memory blocks. Consider the same setup as before, except in this case, the block size is 4 words instead of 1 word. We repeat the dot-product computation in this scenario: Assuming that the vectors occupied linearly in memory, eight FLOPs (four multiply-adds) can be performed in 200 cycles. Prof Vijay More, MET, IOE, BKC, Nashik

This is because a single memory access fetches four consecutive words in the vector. Therefore, two accesses can fetch four elements of each of the vectors. This corresponds to 1 FLOP every 25 ns, for a peak speed of 40 MFLOPS. It is important to note that increasing block size does not change latency of the system. Physically, this scenario can be viewed as a wide data bus (4 words or 128 bits) connected to multiple memory banks. In practice, such wide buses are expensive to construct. In a more practical system, consecutive words are sent on the memory bus on subsequent bus cycles after the first word is retrieved. Prof Vijay More, MET, IOE, BKC, Nashik

Data access patterns is one of the factor which can be used to improve performance. In spatial locality of reference, consecutive data words in memory were used by successive instructions . If we take a data-layout centric view, computations must be reordered to enhance spatial locality of reference. Prof Vijay More, MET, IOE, BKC, Nashik

Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (i = 0; i < 1000; i++) for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vector column_sum. int a[2][3]; for(int i=0;i<2;i++){ for (int j=0;j<3;j++) cout<< \t <<&a[i][j]; cout<<endl; } ----------------------------------------------------------- 0x7fff9987c700 0x7fff9987c704 0x7fff9987c708 0x7fff9987c70c 0x7fff9987c710 0x7fff9987c714 Example for quick reference of data layout Prof Vijay More, MET, IOE, BKC, Nashik

We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is translated in a row-order and performance can be expected to be significantly better. Prof Vijay More, MET, IOE, BKC, Nashik

Trends in Implicit Parallelism and Microprocessor Architectures

Download Presentation

Presentation Transcript

Related

More Related Content