Pipelines in Computer Architecture

OOO Pipelines

Smruti R. Sarangi

Contents

•

In-order Pipelines

•

Out-of-order Pipelines: Motivation

•

Out-of-order Pipelines: Basics

•

Branch Prediction

Pipelines

•

What do we know up till now:

In-order Pipelines

Instructions enter the pipeline in program order

Inst.

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

5-Stage Pipeline

Problems with Inorder Pipelines

•

Hazards

•

Structural Hazards



 Two instructions vie for the same resource

NOT possible in simple 5 stage pipelines

•

Data Hazards



 Producer instruction is not ready when the consumer needs

the data

•

Control Hazards



 Instructions are fetched from the wrong path of the

branch

Inst.

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

Data Hazards in In-order Pipelines

Instruction

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

ld r4, 4[r0]

add r5, r4, 1

Cycle N

Cycle N+1

Need data

now

Earliest it can

be generated

Hazard

Solution: Stall the Pipeline

Instruction

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

ld r4, 4[r0]

add r5, r4, 1

Cycle N

Cycle N+1

ld r4, 4[r0]

Cycle N+2

Control Hazards

Inst.

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

beq .label

add r1, r2, r3

beq .label

sub r5, r6, r7

We know the

status of the

branch now

Cancel these

instructions

Two instruction slots are wasted

Two instruction slots are wasted

Contents

•

In-order Pipelines

•

Out-of-order Pipelines: Motivation

•

Out-of-order Pipelines: Basics

•

Branch Prediction

Performance Equation

•

Is Computer A faster that computer B

•

Wrong Answers:

•

More is the clock speed, faster is the computer

•

More is the RAM, faster is the computer

•

What does it mean for computer A to be faster than computer B

•

Short Answer:

NOTHING

Performance is always with respect to a program. You can say that a certain

program runs faster on computer A as compared to computer B.

Performance Equation - II

•

Let us loosely refer to the reciprocal of the time per program as the

performance

So, what does performance depend on …

•

#instructions in the program

•

Depends on the compiler

•

Frequency

•

Depends on the technology and the architecture

•

IPC

•

Depends on the architecture, and the compiler

How to improve performance?

•

There are

factors

•

IPC, instructions, and frequency

•

#instructions is dependent on the compiler



 not on the architecture

•

Let us look at

IPC

and

frequency

•

IPC

•

What is the IPC of an in-order pipeline?

1 if there are no stalls, otherwise < 1

What about frequency?

•

What is frequency dependent on …

•

 Frequency = 1 / clock period

•

Clock Period:

•

pipeline stage

is expected to take 1 clock cycle

•

Clock period = maximum latency of the pipeline stages

•

How to reduce the clock period

•

Make each stage of the pipeline

smaller

by

increasing

 the number of pipeline

stages

•

Use faster transistors

Limits to Increasing Frequency

•

Assume that we have the fastest possible transistors

•

Can we increase the frequency to 100 GHz

Reasons

Limits to increasing frequency - II

•

What does it mean to have a very high frequency?

•

Before answering keep these facts in mind:

Thumb

Rule



 power



 frequency

Thermo-

dynamics



 power



 Temperature

We need to increase the number of pipeline stages



more hazards, more forwarding paths

How many pipeline stages can we have?

•

We are limited by the latch delay

•

Even with an infinite number of stages, the minimum clock period will

be equal to the latch delay

Logic

Latch

Latch

Logic

Latch

Few stages

More stages

Even more stages

Pipeline Stages vs IPC

•

CPI = 1/ IPC

CPI = CPI

ideal

 + stall_rate * stall_penalty

•

The stall rate will remain more or less constant per instruction

 with the number of pipeline stages

•

The stall penalty (in terms of cycles) will however

increase

•

This will lead to a

loss

 in IPC

Summary: Why we cannot increase

frequency?

Since we cannot increase frequency …

Increase IPC

Increase IPC

•

Issue more instructions per cycle

•

2, 4, or 8 instructions

•

Make it a

superscalar

processor



 A processor that can execute

multiple instructions per cycle

In-order Superscalar Processor

•

Have multiple pipelines.

Inst.

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

Inst.

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

Inst.

Fetch

Operand

Fetch

Execute

Memory

Access

Register

Writeback

Instruction i

Instruction i+1

Instruction i+2

In-order Superscalar Processor - II

•

There can be

dependences

 between instructions

•

Have O(n

) forwarding paths for a n-issue processor

•

Complicated logic for detecting dependences, hazards, and

forwarding

•

Still might not be

enough

...

•

To get the peak IPC (= n) in an n-issue pipeline, we need to ensure

that there are no stalls

•

There will be no stalls if there are no

taken branches

, and no

data

dependences

between instructions.

•

Programs typically do

not have

such long

sequences

 of instructions

without dependences

What to do ...

•

Don’t follow program order

Consider

Too many dependences

mov r1, 1

add r3, r1, r2

add r4, r3, r2

mov r5, 1

add r6, r5, 1

add r8, r7, r6

Execute out of order

mov r1, 1

add r3, r1, r2

add r4, r3, r2

mov r5, 1

add r6, r5, 1

add r8, r7, r6

Execute on a 2-issue OOO processor

Execution on a 2-issue OOO Processor

mov r1, 1

add r3, r1, r2

add r4, r3, r2

mov r5, 1

add r6, r5, 1

add r8, r7, r6

cycle 1

cycle 2

cycle 3

issue slot 1

issue slot 2

Basic OOO idea

Contents

•

In-order Pipelines

•

Out-of-order Pipelines: Motivation

•

Out-of-order Pipelines: Basics

•

Branch Prediction

Revisit the Example

mov r1, 1

add r3, r1, r2

add r4, r3, r2

mov r5, 1

add r6, r5, 1

add r8, r7, r6

Pool of Instructions

Issue ready and

mutually independent

instructions

Pool of Instructions

•

Needs to be

large enough

such that the requisite number of mutually

independent instructions can be found

•

Typical instruction pool sizes: 64 to 128

•

How do we create a large pool of instructions in a program with

branches? We need to be sure that the instructions are on the

 correct

path

for (i = 1; i < m; i++) {

for (j = 1; j < i; j ++ ) {

if (j %2 == 0) continue;

....

Problems with creating an Instruction Pool

Typically 1 in 5 instructions is a

branch

Predict the direction of the

branches, and their targets

Dependences between Instructions

•

Program Order Dependence

mov r1, 1

mov r2, 2

•

One instruction appears after the other in program order

The program order is the order of instructions that is perceived by a

single cycle in-order processor executing the program.

Data Dependences

•

RAW



 Read after Write Dependence (True dependence)

mov r1, 1

      add r3, r1, r2

•

It is a producer-consumer dependence.

•

 The earlier instruction produces a value, and the later instruction

reads it.

Data Dependences - II

•

WAW



 Write after Write Dependence (Output dependence)

mov r1, 1

      add r1, r4, r2

•

Two instructions write to the same location

•

The later instruction needs to take effect after the former

Data Dependences - III

•

WAR



 Write after Read Dependence (Anti dependence)

add r1, r2, r3

add r2, r5, r6

•

Earlier instruction reads, later instruction writes

•

The later instruction needs to execute after the earlier instruction has

read its values

Control Dependences

•

The

add

 instruction is

control dependent

on the

branch

beq

instruction

•

If the branch is taken then only the

add

 instruction will

execute

, not

otherwise

beq .label

.....

.label

add r1, r2, r3

Basic Results

In-order processors respect the program order dependences. Thus,

they automatically respect all data and control dependences.

OOO processors respect only data and control dependences.

Can output and anti dependences be removed?

•

Don’t you think that these dependences are there because

we have a finite number of registers.

•

What if we had an infinite number of registers?

mov r1, 1

      add r1, r4, r2

add r1, r2, r3

add r2, r5, r6

Solution: Infinite number of registers

mov r1, 1

add  r1, r2, r3

add  r4, r1, 1

mov r2, 5

add r6, r2, r8

mov r1, 8

add r9, r1, r2

mov v1, 1

add  v11, v2, v3

add  v4, v11, 1

mov v12, 5

add v6, v12, v8

mov v21, 8

add v9, v21, v12

Code with architectural registers

Code with virtual registers

Renaming

Program with real

(architectural) registers

Program with virtual

registers

RAW dependences

WAR dependences

WAW dependences

RAW dependences

Higher instruction level parallelism (ILP)

Where are we now ...

Fetch + Decode +

Rename + Reg. Read

Memory

Pool of

Instructions

Execution Units

Issue

Write back results

Issue with writeback

To an outsider should it matter if the processor is in-order or OOO

Answer is

NO

Processor

Instructions

Register File

Memory

Update State

Assume that there is an exception or

interrupt

•

Languages like Java have dedicated functions that are called if there is a

divide-by-zero in the code.

•

The question is:

•

Would the sub instruction have executed when we enter the exception handler

mov r4, 10

mov r2, 0

div r3, r1, r2

sub r4, r4, 1

Divide by 0

Divide by zero

handler

Precise Exceptions

•

Assume that the exception handler decides to do nothing and

return

back

•

After this the

sub

instruction should be executed

•

This is exactly what will happen in an

in-order processor

•

In an OOO processor there is a possibility that the

sub

inst. can

execute out of order

Regular Instructions

÷ by 0

Exception

handler

sub

inst.

Precise Exceptions - II

•

To an external observer

•

The execution should always be correct

•

Even in the presence of interrupts and exceptions

Regular Instructions

÷ by 0

Exception

handler

sub

inst.

Regular Instructions

÷ by 0

Exception

handler

sub

inst.

Correct

Wrong

Precise Exceptions - III

•

We thus need

precise exceptions

•

Assume that the dynamic instructions in a program (ordered in

program order

) are: ins

, ins

, ins

 ... ins

•

Assume that the processor starts the exception/interrupt

handler

after it has just finished writing the

results

 of instruction: ins

•

Then instructions: ins

 ... ins

 should have executed

completely

and

written their

results

 to the memory/register file

•

AND, ins

k+1

 onwards should appear to not have started their

execution

at all

•

Such an exception or interrupt is

precise

Precise Exceptions in OOO Processor

•

Now, in an OOO processor how can exceptions be

precise

•

We need a mechanism to ensure that instructions appear to execute

in order

to not an external observer

•

Anything can happen inside the processor

•

But an external observer should not be able to make out if the processor is in-

order or OOO

Contents

•

In-order Pipelines

•

Out-of-order Pipelines: Motivation

•

Out-of-order Pipelines: Basics

•

Branch Prediction

Timing

Fetch i

th

 instruction

Fetch (i+1)

th

 instruction

Fetch (i+2)

th

 instruction

Predict (i+1)

th

 instruction

Predict (i+2)

th

 instruction

Predict (i+3)

th

 instruction

There is no time to look at an instruction

Branch Prediction

•

There are several things to predict:

1.

Predict if an instruction is a branch or not

2.

If it is a branch, predict its outcome

3.

If the branch is taken, predict its target

•

General structure of a predictor

Predictor

Program Counter (PC)

Prediction

Is an instruction a branch or not?

•

Insights

•

Given a PC, the status of the instruction (branch or not) does not change

•

Can we use this information?

•

Approach

•

The last time that we had seen a branch

remember

 its PC

•

Next time we see a PC,

check

 if we have seen it before

•

Also remember the type of the branch:

•

Unconditional

•

Conditional

•

Function call

•

Return

Make a structure in hardware to remember ...

PC

32 - n

Type of the instruction

 entries

Instruction Status

Table (IST)

Basic Method of Operation

•

When we see a branch instruction

•

Access its corresponding entry in the table

•

Record the branch type

•

When we see an instruction:

•

Check the corresponding entry of the table

•

Read the status of the instruction

Instruction Status Table (IST)

•

If we choose the least significant 10 bits of the PC address, we will

have 1024 entries in the IST

•

For a 32 bit PC, we cannot have a 2

 entry

table

•

Too big

•

Too slow

•

Too much of power

•

Too much of area

•

We need to manage with a

smaller

 table

Destructive Interference

•

Two instructions

can map to the

same entry

Instruction A

Instruction B

Defined as

destructive

inteference

Branch Aliasing

•

Possible for a

branch

 and a

non-branch

 instruction to map to the

same entry

•

Possible for two

branches

 to map to the same entry

•

Possible for two

non-branches

 to map to the same entry

•

Need for

disambiguation

•

Augment each entry of the IST

Disambiguation Between Addresses

PC

32 - n

Type of the instruction

 entries

Branch

type

32 – n

Disambiguation

•

We can keep the status of only

branches

 in the IST

•

However, for every

instruction

 we need to check the IST

•

If there is no entry, then we need to predict that it is not a branch

•

 Can be

wrong

•

What to do



•

Why does an IST work?

•

Mainly because most pieces of code exhibit

temporal locality

•

In a given window of time

instructions

 tend to repeat themselves

Predict the Direction of the Branch

•

If a branch is unconditional



 no need to predict (taken)

•

If a branch is a call/return



 no need to predict (taken)

•

If a branch is conditional



 need to predict

Predictor

Program Counter (PC)

Taken or Not Taken

Example Code

void foo() {

for

 (i=0; i < 5; i++) {

...

Assembly

.foo:

mov   r0, 0

.loop:

cmp   r0, 5

beq   .exit

add    r0, r0, 1

b .loop

Predict

Simple Bimodal Predictor

PC

32 - n

Outcome of the branch

 entries

Predictor Table

Bimodal Predictor - II

•

Each

entry

 saves the last

recorded

 outcome of the

branch

•

taken or not-taken

•

What is the problem:

•

The

beq

 instruction is evaluated 6 times

•

Assume the default is

not-taken

•

First 5 times



 correctly predicted

not-taken

•

th

 and last time



 incorrectly predicted to be

taken

•

Now assume we call the function foo() once again

•

The first time



 we will predict

taken

. This is wrong

•

Again the last time will be wrong

•

Correct prediction rate: 2/6

Increasing Accuracy

•

What is the

problem

•

The

beq

 instruction is fundamentally biased towards:

not-taken

•

One exception at the end of the

for

loop cannot change its inherent

behavior

•

Let us add some

hysteresis

•

Instead of having a single bit in each entry, have a 2 bit counter

Saturating Counter

Increment

Decrement

Saturating Counters

•

Algorithm for updates:

•

If a branch is

taken

increment

the saturating counter

•

If it is

not taken

decrement

the counter

•

For prediction:

•

00 and 01



 not taken

•

10 and 11



 taken

Predict: Not Taken

Predict: Taken

Bimodal Predictor with Saturating Counters

PC

32 - n

Outcome of the branch

 entries

Predictor Table

2-bit saturating

counter

Logic

Will it help?

•

st

 Time: Predict not-taken.

Correct

. Starts with 01. Moves to 00

•

nd

 Time: Predict not-taken.

Correct

. Remains at 00

•

...

•

th

 Time: Predict not-taken.

Wrong

. Starts with 00. Moves to 01

.foo:

mov   r0, 0

.loop:

cmp   r0, 5

beq   .exit

add    r0, r0, 1

b .loop

Misprediction rate: Down from 2/6 to 1/6

Can we do better?

void foo() {

for

 (i=0; i < 5; i++) {

...

if (i == 4) {

foobar();

Assembly

What if we had?

.foo:

mov   r0, 0

.loop:

cmp   r0, 5

beq   .exit

add    r0, r0, 1

cmp   r0, 4

bne    .loop

call     .foobar

b .loop

Global History Register (GHR)

•

Let us have a

shift register

that records the

history

 of the last

branches

•

We record: 1



 taken branch, 0



 not taken branch

•

Let us consider a 2-bit

GHR

•

We have two conditional branches in the running example

•

beq .exit

•

bne .loop

Status of the GHR

•

Beginning of 2

nd

 Iteration

•

•

Beginning of 4

th

 Iteration

•

•

Beginning of 5

th

 iteration

•

•

Beginning of 6

th

 iteration

•

.foo:

mov   r0, 0

.loop:

cmp   r0, 5

beq   .exit

add    r0, r0, 1

cmp   r0, 4

bne    .loop

call     .foobar

b .loop

Gap Predictor

PC

32 - n

Outcome of the branch

n+k

 entries

k-bit GHR

Logic

PHT: Pattern History Table

Gap Predictor

•

We create 2

 times more

entries

 in the predictor table

•

Captures

 the global branch history

•

In this case, we can remove the

misprediction

after the 5

th

 iteration

(6

th

 time the instruction

beq .exit

is evaluated)

•

The accuracy is expected to increase.

•

In general we can create different

combinations

of:

•

The PC bits and the GHR’s branch history

Another way of combining information:

GShare

PC

32 - n

Outcome of the branch

max(n,k)

 entries

k-bit GHR

Logic

XOR

Can we design a class of predictors?

•



 Global, P



 Pattern

•

All four combinations are possible

•

Gag

•

Gap

•

Pag

•

Pap

Gag Predictor

Outcome of the branch

 entries

k-bit GHR

Logic

Pag predictor

PC

Outcome of the branch

 entries

k-bit GHR

Logic

n1

k-bit GHR

Pap predictor

PC

32 - n

Outcome of the branch

n+k

 entries

k-bit GHR

Logic

n1

k-bit GHR

Tournament Predictor

•

Different predictors have different

accuracies

 for different

programs

•

How to know which predictor is best for which program?

•

What to do?

•

Answer: Use two

predictors

, and

choose

 between them

Predictor 1

Predictor 2

Choose

PC

n bits

Choose

Predictor 1

Predictor 2

Prediction

Operation of a Tournament Predictor

•

Prediction

•

Find the entry in the

chooser table

•

Choose the predictor

•

Use its prediction

•

Training

•

Train both the

predictors

•

Train the entry in the chooser array if we chose the wrong predictor

•

Modification: We can either use 1 bit in each entry of the

chooser

table or use a 2-bit

saturating counter

Other methods of prediction

•

Reduce

aliasing

•

Incorporate a few tag bits in each

entry

 of the predictor

•

Have multiple predictors for different

subsets

 of branches

•

Include a bias bit with every branch (its most likely direction), and just predict

if we need to agree with the bias bit or not (agree predictor)

•

Ensure that all entries of the PHT are uniformly used

•

Better

use

of bits

•

Separate

high confidence

and

low confidence

 branches. Dedicate more bits to

low confidence branches

•

Examples of other predictors:

•

Bi-mode, Agree, Skew, YAGS, TAGE

Prediction and Compression

•

Consider a sequence of

<PC,outcome>

pairs

•

Let’s say compress this sequence using a standard program: zip, rar

•

Is the prediction accuracy related to the compression ratio (size of

compressed file/ size of original file)

•

YES

•

Take a course on information theory

•

Read about the Fano’s inequality

•

See this

link

Predict the Branch Target

Predictor

PC of the taken branch

Target

Use the IST. Let us call it the Branch Target

Buffer (BTB)

PC

32 - n

Type of the instruction

and target

 entries

Branch

type

32 – n

target

Calls and Returns

foo

foobar

foobarbar

call

call

ret

ret

0xFC0

0xFF4

foobarbarbar

ret

call

0xBF8

How to predict return addresses?

Solution: Use a stack

Return address stack (RAS)

foo

foobar

foobarbar

call

call

ret

ret

0xFC0

0xFF4

foobarbarbar

ret

call

0xBF8

0xFC0

0xFF4

0xBF8

Stack

Summary: The Branch Prediction System

PC

BTB

Type & Target

Branch

Predictor

PC

RAS

Choose

the

next PC

next PC

Slide Note

Embed Share

Download

Explore the concepts of in-order and out-of-order pipelines, along with the basics of branch prediction. Learn about the challenges and solutions related to data hazards, control hazards, and performance equations in computer architecture. Discover how to optimize pipeline efficiency for faster computing processes.

cristophe Follow

Uploaded on Oct 05, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

OOO Pipelines Smruti R. Sarangi

Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

Pipelines What do we know up till now: In-order Pipelines Instructions enter the pipeline in program order Register Writeback Inst. Fetch Operand Fetch Memory Access Execute 5-Stage Pipeline

Problems with Inorder Pipelines Register Writeback Inst. Fetch Operand Fetch Memory Access Execute Hazards Structural Hazards Two instructions vie for the same resource (NOT possible in simple 5 stage pipelines) Data Hazards Producer instruction is not ready when the consumer needs the data Control Hazards Instructions are fetched from the wrong path of the branch

Data Hazards in In-order Pipelines Register Writeback Instruction Fetch Operand Fetch Memory Access Execute add r5, r4, 1 ld r4, 4[r0] Need data now Earliest it can be generated Cycle N Cycle N+1 Hazard clock

Solution: Stall the Pipeline Register Writeback Instruction Fetch Operand Fetch Memory Access Execute ld r4, 4[r0] add r5, r4, 1 ld r4, 4[r0] Need data now Here is the data Cycle N Cycle N+1 Cycle N+2 clock

Control Hazards Register Writeback Inst. Fetch Operand Fetch Memory Access Execute beq .label add r1, r2, r3 sub r5, r6, r7 beq .label Cancel these instructions We know the status of the branch now Two instruction slots are wasted

Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

Performance Equation Is Computer A faster that computer B Wrong Answers: More is the clock speed, faster is the computer More is the RAM, faster is the computer What does it mean for computer A to be faster than computer B Short Answer: NOTHING Performance is always with respect to a program. You can say that a certain program runs faster on computer A as compared to computer B.

Performance Equation - II #???????? #??????? = #???????? = = ??? ???? *#????? #?????? * #?????? #??????? #????? ??? ???? #?????/??????? #????? (assume just 1 program) Let us loosely refer to the reciprocal of the time per program as the performance

So, what does performance depend on #instructions in the program Depends on the compiler Frequency Depends on the technology and the architecture IPC Depends on the architecture, and the compiler

How to improve performance? There are 3 factors: IPC, instructions, and frequency #instructions is dependent on the compiler not on the architecture Let us look at IPC and frequency IPC What is the IPC of an in-order pipeline? 1 if there are no stalls, otherwise < 1

What about frequency? What is frequency dependent on Frequency = 1 / clock period Clock Period: 1 pipeline stage is expected to take 1 clock cycle Clock period = maximum latency of the pipeline stages How to reduce the clock period Make each stage of the pipeline smaller by increasing the number of pipeline stages Use faster transistors

Limits to Increasing Frequency Assume that we have the fastest possible transistors Can we increase the frequency to 100 GHz Reasons

Limits to increasing frequency - II What does it mean to have a very high frequency? Before answering keep these facts in mind: P power f frequency Thumb Rule ? ?3 1 P power T Temperature Thermo- dynamics 2 T ? We need to increase the number of pipeline stages more hazards, more forwarding paths 3

How many pipeline stages can we have? Logic Logic Latch Latch Latch Even more stages More stages Few stages We are limited by the latch delay Even with an infinite number of stages, the minimum clock period will be equal to the latch delay

Pipeline Stages vs IPC CPI = 1/ IPC CPI = CPIideal + stall_rate * stall_penalty The stall rate will remain more or less constant per instruction with the number of pipeline stages The stall penalty (in terms of cycles) will however increase This will lead to a loss in IPC

Summary: Why we cannot increase frequency? Power Temperature Effect of the Latch Delay Stall penalties will increase

Since we cannot increase frequency Increase IPC

Increase IPC Issue more instructions per cycle 2, 4, or 8 instructions Make it a superscalar processor A processor that can execute multiple instructions per cycle

In-order Superscalar Processor Have multiple pipelines. Register Writeback Inst. Fetch Operand Fetch Memory Access Instruction i Execute Register Writeback Inst. Fetch Operand Fetch Memory Access Instruction i+1 Execute Register Writeback Inst. Fetch Operand Fetch Memory Access Instruction i+2 Execute

In-order Superscalar Processor - II There can be dependences between instructions Have O(n2) forwarding paths for a n-issue processor Complicated logic for detecting dependences, hazards, and forwarding Still might not be enough ... To get the peak IPC (= n) in an n-issue pipeline, we need to ensure that there are no stalls There will be no stalls if there are no taken branches, and no data dependences between instructions. Programs typically do not have such long sequences of instructions without dependences

What to do ... Don t follow program order Consider mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r6, r5, 1 add r8, r7, r6 Too many dependences

Execute out of order Execute on a 2-issue OOO processor mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r6, r5, 1 add r8, r7, r6

Execution on a 2-issue OOO Processor issue slot 1 issue slot 2 cycle 1 cycle 2 cycle 3 mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r6, r5, 1 add r8, r7, r6

Basic OOO idea Create a pool of instructions Find instructions that are mutually independent and have all their operands ready Execute them out of order

Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

Revisit the Example Pool of Instructions mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r8, r7, r6 add r6, r5, 1 Issue ready and mutually independent instructions

Pool of Instructions Needs to be large enough such that the requisite number of mutually independent instructions can be found Typical instruction pool sizes: 64 to 128 How do we create a large pool of instructions in a program with branches? We need to be sure that the instructions are on the correct path for (i = 1; i < m; i++) { for (j = 1; j < i; j ++ ) { if (j %2 == 0) continue; .... } }

Problems with creating an Instruction Pool Typically 1 in 5 instructions is a branch Predict the direction of the branches, and their targets

Dependences between Instructions Program Order Dependence mov r1, 1 mov r2, 2 One instruction appears after the other in program order The program order is the order of instructions that is perceived by a single cycle in-order processor executing the program.

Data Dependences RAW Read after Write Dependence (True dependence) mov r1, 1 add r3, r1, r2 It is a producer-consumer dependence. The earlier instruction produces a value, and the later instruction reads it.

Data Dependences - II WAW Write after Write Dependence (Output dependence) mov r1, 1 add r1, r4, r2 Two instructions write to the same location The later instruction needs to take effect after the former

Data Dependences - III WAR Write after Read Dependence (Anti dependence) add r1, r2, r3 add r2, r5, r6 Earlier instruction reads, later instruction writes The later instruction needs to execute after the earlier instruction has read its values

Control Dependences beq .label ..... .label add r1, r2, r3 The add instruction is control dependent on the branch(beq) instruction If the branch is taken then only the add instruction will execute, not otherwise

Basic Results In-order processors respect the program order dependences. Thus, they automatically respect all data and control dependences. OOO processors respect only data and control dependences.

Can output and anti dependences be removed? add r1, r2, r3 add r2, r5, r6 mov r1, 1 add r1, r4, r2 Don t you think that these dependences are there because we have a finite number of registers. What if we had an infinite number of registers?

Solution: Infinite number of registers mov r1, 1 add r1, r2, r3 add r4, r1, 1 mov r2, 5 add r6, r2, r8 mov r1, 8 add r9, r1, r2 mov v1, 1 add v11, v2, v3 add v4, v11, 1 mov v12, 5 add v6, v12, v8 mov v21, 8 add v9, v21, v12 Code with architectural registers Code with virtual registers

Renaming Program with virtual registers Program with real (architectural) registers RAW dependences RAW dependences WAR dependences WAW dependences Higher instruction level parallelism (ILP)

Where are we now ... Pool of Instructions Fetch + Decode + Rename + Reg. Read Issue Memory Execution Units Write back results

Issue with writeback Register File Update State Instructions Processor Memory To an outsider should it matter if the processor is in-order or OOO Answer is NO

Assume that there is an exception or interrupt Divide by 0 mov r4, 10 mov r2, 0 div r3, r1, r2 sub r4, r4, 1 Divide by zero handler Languages like Java have dedicated functions that are called if there is a divide-by-zero in the code. The question is: Would the sub instruction have executed when we enter the exception handler

Precise Exceptions sub inst. Exception handler by 0 Regular Instructions Assume that the exception handler decides to do nothing and return back After this the sub instruction should be executed This is exactly what will happen in an in-order processor In an OOO processor there is a possibility that the sub inst. can execute out of order

Precise Exceptions - II Correct sub inst. Exception handler by 0 Regular Instructions Wrong sub inst. Exception handler by 0 Regular Instructions To an external observer The execution should always be correct Even in the presence of interrupts and exceptions

Precise Exceptions - III We thus need precise exceptions Assume that the dynamic instructions in a program (ordered in program order) are: ins1, ins2, ins3 ... insn Assume that the processor starts the exception/interrupt handler after it has just finished writing the results of instruction: insk Then instructions: ins1 ... insk should have executed completely and written their results to the memory/register file AND, insk+1 onwards should appear to not have started their execution at all Such an exception or interrupt is precise

Precise Exceptions in OOO Processor Now, in an OOO processor how can exceptions be precise? We need a mechanism to ensure that instructions appear to execute in order to not an external observer Anything can happen inside the processor But an external observer should not be able to make out if the processor is in- order or OOO

Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

Timing Fetch ith instruction Fetch (i+1)th instruction Predict (i+1)th instruction Fetch (i+2)th instruction Predict (i+2)th instruction Predict (i+3)th instruction There is no time to look at an instruction

Branch Prediction There are several things to predict: 1. Predict if an instruction is a branch or not 2. If it is a branch, predict its outcome 3. If the branch is taken, predict its target General structure of a predictor Program Counter (PC) Prediction Predictor

Is an instruction a branch or not? Insights Given a PC, the status of the instruction (branch or not) does not change Can we use this information? Approach The last time that we had seen a branch remember its PC Next time we see a PC, check if we have seen it before Also remember the type of the branch: Unconditional Conditional Function call Return

Pipelines in Computer Architecture

Download Presentation

Presentation Transcript

Related

More Related Content