Pipelines in Computer Architecture

OOO Pipelines
Smruti R. Sarangi
Contents
In-order Pipelines
Out-of-order Pipelines: Motivation
Out-of-order Pipelines: Basics
Branch Prediction
Pipelines
What do we know up till now:
In-order Pipelines
Instructions enter the pipeline in program order
Inst.
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
5-Stage Pipeline
Problems with Inorder Pipelines
Hazards
Structural Hazards 
 Two instructions vie for the same resource
                     (
NOT possible in simple 5 stage pipelines
)
Data Hazards 
 Producer instruction is not ready when the consumer needs
   
the data
Control Hazards 
 Instructions are fetched from the wrong path of the
branch
Inst.
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
Data Hazards in In-order Pipelines
Instruction
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
ld r4, 4[r0]
add r5, r4, 1
Cycle N
Cycle N+1
Need data
now
Earliest it can
be generated
Hazard
Solution: Stall the Pipeline
Instruction
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
ld r4, 4[r0]
add r5, r4, 1
Cycle N
Cycle N+1
ld r4, 4[r0]
Cycle N+2
Control Hazards
Inst.
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
beq .label
add r1, r2, r3
beq .label
sub r5, r6, r7
We know the
status of the
branch now
Cancel these
instructions
Two instruction slots are wasted
Two instruction slots are wasted
Contents
In-order Pipelines
Out-of-order Pipelines: Motivation
Out-of-order Pipelines: Basics
Branch Prediction
Performance Equation
Is Computer A faster that computer B
Wrong Answers:
More is the clock speed, faster is the computer
More is the RAM, faster is the computer
What does it mean for computer A to be faster than computer B
Short Answer: 
NOTHING
Performance is always with respect to a program. You can say that a certain
program runs faster on computer A as compared to computer B.
Performance Equation - II
Let us loosely refer to the reciprocal of the time per program as the
performance
So, what does performance depend on …
#instructions in the program
Depends on the compiler
Frequency
Depends on the technology and the architecture
IPC
Depends on the architecture, and the compiler
How to improve performance?
There are 
3
 
factors
:
IPC, instructions, and frequency
#instructions is dependent on the compiler 
 not on the architecture
Let us look at 
IPC
 and 
frequency
IPC
What is the IPC of an in-order pipeline?
1 if there are no stalls, otherwise < 1
What about frequency?
What is frequency dependent on …
 Frequency = 1 / clock period
Clock Period:
1 
pipeline stage 
is expected to take 1 clock cycle
Clock period = maximum latency of the pipeline stages
How to reduce the clock period
Make each stage of the pipeline 
smaller
 by 
increasing
 the number of pipeline
stages
Use faster transistors
Limits to Increasing Frequency
Assume that we have the fastest possible transistors
Can we increase the frequency to 100 GHz
Reasons
Limits to increasing frequency - II
What does it mean to have a very high frequency?
Before answering keep these facts in mind:
Thumb
Rule
 
P 
 power
f 
 frequency
Thermo-
dynamics
P 
 power
T 
 Temperature
We need to increase the number of pipeline stages 
more hazards, more forwarding paths
1
2
3
How many pipeline stages can we have?
We are limited by the latch delay
Even with an infinite number of stages, the minimum clock period will
be equal to the latch delay
Logic
Latch
Latch
Logic
Latch
Few stages
More stages
Even more stages
Pipeline Stages vs IPC
CPI = 1/ IPC
CPI = CPI
ideal
 + stall_rate * stall_penalty
The stall rate will remain more or less constant per instruction
 with the number of pipeline stages
The stall penalty (in terms of cycles) will however 
increase
This will lead to a 
loss
 in IPC 
Summary: Why we cannot increase
frequency?
Since we cannot increase frequency …
Increase IPC
Increase IPC
 
Issue more instructions per cycle
2, 4, or 8 instructions
Make it a 
superscalar
 
processor 
 A processor that can execute
multiple instructions per cycle
In-order Superscalar Processor
Have multiple pipelines.
Inst.
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
Inst.
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
Inst.
Fetch
Operand
Fetch
Execute
Memory
Access
Register
Writeback
 
Instruction i 
Instruction i+1 
Instruction i+2 
In-order Superscalar Processor - II
There can be 
dependences
 between instructions
Have O(n
2
) forwarding paths for a n-issue processor
Complicated logic for detecting dependences, hazards, and
forwarding
Still might not be 
enough
 ...
To get the peak IPC (= n) in an n-issue pipeline, we need to ensure
that there are no stalls
There will be no stalls if there are no 
taken branches
, and no 
data
dependences 
between instructions.
Programs typically do 
not have 
such long 
sequences
 of instructions
without dependences
What to do ...
Don’t follow program order
Consider
Too many dependences
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov r5, 1
add r6, r5, 1
add r8, r7, r6
Execute out of order
mov r1, 1
add r3, r1, r2
add r4, r3, r2
 
mov r5, 1
add r6, r5, 1
add r8, r7, r6
Execute on a 2-issue OOO processor
Execution on a 2-issue OOO Processor
 
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov r5, 1
add r6, r5, 1
add r8, r7, r6
cycle 1
cycle 2
cycle 3
issue slot 1
issue slot 2
Basic OOO idea
Contents
In-order Pipelines
Out-of-order Pipelines: Motivation
Out-of-order Pipelines: Basics
Branch Prediction
Revisit the Example
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov r5, 1
add r6, r5, 1
add r8, r7, r6
Pool of Instructions
Issue ready and
mutually independent
instructions
Pool of Instructions
Needs to be 
large enough 
such that the requisite number of mutually
independent instructions can be found
Typical instruction pool sizes: 64 to 128
How do we create a large pool of instructions in a program with
branches? We need to be sure that the instructions are on the
 correct
path
for (i = 1; i < m; i++) {
 
for (j = 1; j < i; j ++ ) {
  
if (j %2 == 0) continue;
  
....
 
}
}
Problems with creating an Instruction Pool
Typically 1 in 5 instructions is a
branch
Predict the direction of the
branches, and their targets
Dependences between Instructions
Program Order Dependence
mov r1, 1
mov r2, 2
One instruction appears after the other in program order
The program order is the order of instructions that is perceived by a
single cycle in-order processor executing the program.
Data Dependences
RAW 
 Read after Write Dependence (True dependence)
mov r1, 1
      add r3, r1, r2
It is a producer-consumer dependence.
 The earlier instruction produces a value, and the later instruction
reads it.
Data Dependences - II
WAW 
 Write after Write Dependence (Output dependence)
mov r1, 1
      add r1, r4, r2
Two instructions write to the same location
The later instruction needs to take effect after the former
Data Dependences - III
WAR 
 Write after Read Dependence (Anti dependence)
add r1, r2, r3
add r2, r5, r6
Earlier instruction reads, later instruction writes
The later instruction needs to execute after the earlier instruction has
read its values
Control Dependences
The 
add
 instruction is 
control dependent 
on the 
branch
(
beq
)
instruction
If the branch is taken then only the 
add
 instruction will 
execute
, not
otherwise
beq .label
.....
.label
 
add r1, r2, r3
Basic Results
In-order processors respect the program order dependences. Thus,
they automatically respect all data and control dependences.
OOO processors respect only data and control dependences.
Can output and anti dependences be removed?
Don’t you think that these dependences are there because
we have a finite number of registers.
What if we had an infinite number of registers?
mov r1, 1
      add r1, r4, r2
add r1, r2, r3
add r2, r5, r6
Solution: Infinite number of registers
mov r1, 1
add  r1, r2, r3
add  r4, r1, 1
mov r2, 5
add r6, r2, r8
mov r1, 8
add r9, r1, r2
mov v1, 1
add  v11, v2, v3
add  v4, v11, 1
mov v12, 5
add v6, v12, v8
mov v21, 8
add v9, v21, v12
Code with architectural registers
Code with virtual registers
Renaming
Program with real
(architectural) registers
Program with virtual
registers
RAW dependences
WAR dependences
WAW dependences
RAW dependences
Higher instruction level parallelism (ILP)
Where are we now ...
 
Fetch + Decode +
Rename + Reg. Read
Memory
Pool of
Instructions
Execution Units
Issue
Write back results
Issue with writeback
To an outsider should it matter if the processor is in-order or OOO
Answer is 
NO
Processor
Instructions
Register File
Memory
Update State
Assume that there is an exception or
interrupt
Languages like Java have dedicated functions that are called if there is a
divide-by-zero in the code.
The question is:
Would the sub instruction have executed when we enter the exception handler
mov r4, 10
mov r2, 0
div r3, r1, r2
sub r4, r4, 1
Divide by 0
Divide by zero
handler
Precise Exceptions
Assume that the exception handler decides to do nothing and 
return
back
After this the 
sub
 
instruction should be executed
This is exactly what will happen in an 
in-order processor
In an OOO processor there is a possibility that the 
sub
 
inst. can
execute out of order
Regular Instructions
÷ by 0
Exception
handler
sub
inst.
Precise Exceptions - II
To an external observer
The execution should always be correct
Even in the presence of interrupts and exceptions
Regular Instructions
÷ by 0
Exception
handler
sub
inst.
Regular Instructions
÷ by 0
Exception
handler
sub
inst.
Correct
Wrong
Precise Exceptions - III
We thus need 
precise exceptions
Assume that the dynamic instructions in a program (ordered in
program order
) are: ins
1
, ins
2
, ins
3
 ... ins
n
Assume that the processor starts the exception/interrupt 
handler
after it has just finished writing the 
results
 of instruction: ins
k
Then instructions: ins
1
 ... ins
k
 should have executed 
completely
 and
written their 
results
 to the memory/register file
AND, ins
k+1
 onwards should appear to not have started their
execution 
at all
Such an exception or interrupt is 
precise
Precise Exceptions in OOO Processor
Now, in an OOO processor how can exceptions be 
precise
?
We need a mechanism to ensure that instructions appear to execute
in order 
to not an external observer
Anything can happen inside the processor
But an external observer should not be able to make out if the processor is in-
order or OOO
Contents
In-order Pipelines
Out-of-order Pipelines: Motivation
Out-of-order Pipelines: Basics
Branch Prediction
Timing
 
Fetch i
th
 instruction
Fetch (i+1)
th
 instruction
Fetch (i+2)
th
 instruction
Predict (i+1)
th
 instruction
Predict (i+2)
th
 instruction
Predict (i+3)
th
 instruction
There is no time to look at an instruction
Branch Prediction
There are several things to predict:
1.
Predict if an instruction is a branch or not
2.
If it is a branch, predict its outcome
3.
If the branch is taken, predict its target
General structure of a predictor
Predictor
Program Counter (PC)
Prediction
Is an instruction a branch or not?
Insights
Given a PC, the status of the instruction (branch or not) does not change
Can we use this information?
Approach
The last time that we had seen a branch 
remember
 its PC
Next time we see a PC, 
check
 if we have seen it before
Also remember the type of the branch:
Unconditional
Conditional
Function call
Return
Make a structure in hardware to remember ...
PC
32 - n
n
Type of the instruction
2
n
 entries
Instruction Status
Table (IST)
Basic Method of Operation
When we see a branch instruction
Access its corresponding entry in the table
Record the branch type
When we see an instruction:
Check the corresponding entry of the table
Read the status of the instruction
Instruction Status Table (IST)
If we choose the least significant 10 bits of the PC address, we will
have 1024 entries in the IST
For a 32 bit PC, we cannot have a 2
32
 entry 
table
.
Too big
Too slow
Too much of power
Too much of area
We need to manage with a 
smaller
 table
Destructive Interference
Two instructions
can map to the
same entry
Instruction A
Instruction B
Defined as 
destructive
inteference
Branch Aliasing
Possible for a 
branch
 and a 
non-branch
 instruction to map to the
same entry
Possible for two 
branches
 to map to the same entry
Possible for two 
non-branches
 to map to the same entry
Need for 
disambiguation
 
Augment each entry of the IST
Disambiguation Between Addresses
PC
32 - n
n
Type of the instruction
2
n
 entries
Branch
type
32 – n
Disambiguation
We can keep the status of only 
branches
 in the IST
However, for every 
instruction
 we need to check the IST
If there is no entry, then we need to predict that it is not a branch
 Can be 
wrong
What to do 
Why does an IST work?
Mainly because most pieces of code exhibit 
temporal locality
In a given window of time 
instructions
 tend to repeat themselves
Predict the Direction of the Branch
If a branch is unconditional 
 no need to predict (taken)
If a branch is a call/return 
 no need to predict (taken)
If a branch is conditional 
 need to predict
Predictor
Program Counter (PC)
Taken or Not Taken
Example Code
void foo() {
 
for
 (i=0; i < 5; i++) {
  
...
 
}
}
C
Assembly
.foo:
 
mov   r0, 0
.loop:
 
cmp   r0, 5
 
beq   .exit
 
add    r0, r0, 1
 
b .loop
 
  
Predict
Simple Bimodal Predictor
 
PC
32 - n
n
Outcome of the branch
2
n
 entries
Predictor Table
Bimodal Predictor - II
Each 
entry
 saves the last 
recorded
 outcome of the 
branch
taken or not-taken
What is the problem:
The 
beq
 instruction is evaluated 6 times
Assume the default is 
not-taken
First 5 times 
 correctly predicted 
not-taken
6
th
 and last time 
 incorrectly predicted to be 
taken
Now assume we call the function foo() once again
The first time 
 we will predict 
taken
. This is wrong
Again the last time will be wrong
Correct prediction rate: 2/6
Increasing Accuracy
What is the 
problem
?
The 
beq
 instruction is fundamentally biased towards: 
not-taken
One exception at the end of the 
for
 
loop cannot change its inherent 
behavior
Let us add some 
hysteresis
Instead of having a single bit in each entry, have a 2 bit counter
Saturating Counter
00
01
10
11
 
Increment
 
Decrement
Saturating Counters
Algorithm for updates:
If a branch is 
taken
, 
increment 
the saturating counter
If it is 
not taken
, 
decrement 
the counter
For prediction:
00 and 01 
 not taken
10 and 11 
 taken
00
01
10
11
Predict: Not Taken
Predict: Taken
Bimodal Predictor with Saturating Counters
PC
32 - n
n
Outcome of the branch
2
n
 entries
Predictor Table
2-bit saturating
counter
Logic
Will it help?
1
st
 Time: Predict not-taken. 
Correct
. Starts with 01. Moves to 00
2
nd
 Time: Predict not-taken. 
Correct
. Remains at 00
...
6
th
 Time: Predict not-taken. 
Wrong
. Starts with 00. Moves to 01
.foo:
 
mov   r0, 0
.loop:
 
cmp   r0, 5
 
beq   .exit
 
add    r0, r0, 1
 
b .loop
 
  
Misprediction rate: Down from 2/6 to 1/6
Can we do better?
void foo() {
 
for
 (i=0; i < 5; i++) {
  
...
  
if (i == 4) {
   
foobar();
  
}
 
}
}
C
Assembly
What if we had?
.foo:
 
mov   r0, 0
.loop:
 
cmp   r0, 5
 
beq   .exit
 
add    r0, r0, 1
 
cmp   r0, 4
 
bne    .loop
 
call     .foobar
 
b .loop
 
  
Global History Register (GHR)
Let us have a 
shift register 
that records the 
history
 of the last 
n
branches
We record: 1 
 taken branch, 0 
 not taken branch
Let us consider a 2-bit 
GHR
We have two conditional branches in the running example
beq .exit
bne .loop
Status of the GHR
Beginning of 2
nd
 Iteration
01
Beginning of 4
th
 Iteration
01
Beginning of 5
th
 iteration
01
Beginning of 6
th
 iteration
00
.foo:
 
mov   r0, 0
.loop:
 
cmp   r0, 5
 
beq   .exit
 
add    r0, r0, 1
 
cmp   r0, 4
 
bne    .loop
 
call     .foobar
 
b .loop
 
  
Gap Predictor
PC
32 - n
n
Outcome of the branch
2
n+k
 entries
k-bit GHR
Logic
n
k
PHT: Pattern History Table
Gap Predictor
We create 2
k
 times more 
entries
 in the predictor table
Captures
 the global branch history
In this case, we can remove the 
misprediction 
after the 5
th
 iteration
(6
th
 time the instruction 
beq .exit 
is evaluated)
The accuracy is expected to increase.
In general we can create different 
combinations
 of:
The PC bits and the GHR’s branch history
Another way of combining information:
GShare
PC
32 - n
n
Outcome of the branch
2
max(n,k)
 entries
k-bit GHR
Logic
XOR
Can we design a class of predictors?
G 
 Global, P 
 Pattern
All four combinations are possible
Gag
Gap
Pag
Pap
Gag Predictor
Outcome of the branch
2
k
 entries
k-bit GHR
Logic
k
Pag predictor
PC
Outcome of the branch
2
k
 entries
k-bit GHR
Logic
n1
k-bit GHR
k
Pap predictor
PC
32 - n
n
Outcome of the branch
2
n+k
 entries
k-bit GHR
Logic
n1
k-bit GHR
n
k
Tournament Predictor
Different predictors have different 
accuracies
 for different 
programs
How to know which predictor is best for which program?
What to do?
Answer: Use two 
predictors
, and 
choose
 between them
Predictor 1
Predictor 2
Choose
 
PC
n bits
Choose
Predictor 1
Predictor 2
Prediction
Operation of a Tournament Predictor
Prediction
Find the entry in the 
chooser table
Choose the predictor
Use its prediction
Training
Train both the 
predictors
Train the entry in the chooser array if we chose the wrong predictor
Modification: We can either use 1 bit in each entry of the 
chooser
table or use a 2-bit 
saturating counter
Other methods of prediction
Reduce 
aliasing
Incorporate a few tag bits in each 
entry
 of the predictor
Have multiple predictors for different 
subsets
 of branches
Include a bias bit with every branch (its most likely direction), and just predict
if we need to agree with the bias bit or not (agree predictor)
Ensure that all entries of the PHT are uniformly used
Better 
use 
of bits
Separate 
high confidence 
and 
low confidence
 branches. Dedicate more bits to
low confidence branches
Examples of other predictors:
Bi-mode, Agree, Skew, YAGS, TAGE
Prediction and Compression
Consider a sequence of 
<PC,outcome> 
pairs
Let’s say compress this sequence using a standard program: zip, rar
Is the prediction accuracy related to the compression ratio (size of
compressed file/ size of original file)
YES
Take a course on information theory
Read about the Fano’s inequality
See this 
link
Predict the Branch Target
 
Predictor
PC of the taken branch
Target
Use the IST. Let us call it the Branch Target
Buffer (BTB)
PC
32 - n
n
Type of the instruction
and target
2
n
 entries
Branch
type
32 – n
target
Calls and Returns
foo
foobar
foobarbar
call
call
ret
ret
0xFC0
0xFF4
foobarbarbar
ret
call
0xBF8
How to predict return addresses?
Solution: Use a stack
Return address stack (RAS)
foo
foobar
foobarbar
call
call
ret
ret
0xFC0
0xFF4
foobarbarbar
ret
call
0xBF8
0xFC0
0xFF4
0xBF8
Stack
Summary: The Branch Prediction System
PC
BTB
Type & Target
Branch
Predictor
PC
RAS
Choose
the
next PC
next PC
Slide Note
Embed
Share

Explore the concepts of in-order and out-of-order pipelines, along with the basics of branch prediction. Learn about the challenges and solutions related to data hazards, control hazards, and performance equations in computer architecture. Discover how to optimize pipeline efficiency for faster computing processes.

  • Computer Architecture
  • Pipelines
  • Data Hazards
  • Control Hazards
  • Branch Prediction

Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. OOO Pipelines Smruti R. Sarangi

  2. Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

  3. Pipelines What do we know up till now: In-order Pipelines Instructions enter the pipeline in program order Register Writeback Inst. Fetch Operand Fetch Memory Access Execute 5-Stage Pipeline

  4. Problems with Inorder Pipelines Register Writeback Inst. Fetch Operand Fetch Memory Access Execute Hazards Structural Hazards Two instructions vie for the same resource (NOT possible in simple 5 stage pipelines) Data Hazards Producer instruction is not ready when the consumer needs the data Control Hazards Instructions are fetched from the wrong path of the branch

  5. Data Hazards in In-order Pipelines Register Writeback Instruction Fetch Operand Fetch Memory Access Execute add r5, r4, 1 ld r4, 4[r0] Need data now Earliest it can be generated Cycle N Cycle N+1 Hazard clock

  6. Solution: Stall the Pipeline Register Writeback Instruction Fetch Operand Fetch Memory Access Execute ld r4, 4[r0] add r5, r4, 1 ld r4, 4[r0] Need data now Here is the data Cycle N Cycle N+1 Cycle N+2 clock

  7. Control Hazards Register Writeback Inst. Fetch Operand Fetch Memory Access Execute beq .label add r1, r2, r3 sub r5, r6, r7 beq .label Cancel these instructions We know the status of the branch now Two instruction slots are wasted

  8. Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

  9. Performance Equation Is Computer A faster that computer B Wrong Answers: More is the clock speed, faster is the computer More is the RAM, faster is the computer What does it mean for computer A to be faster than computer B Short Answer: NOTHING Performance is always with respect to a program. You can say that a certain program runs faster on computer A as compared to computer B.

  10. Performance Equation - II #???????? #??????? = #???????? = = ??? ???? *#????? #?????? * #?????? #??????? #????? ??? ???? #?????/??????? #????? (assume just 1 program) Let us loosely refer to the reciprocal of the time per program as the performance

  11. So, what does performance depend on #instructions in the program Depends on the compiler Frequency Depends on the technology and the architecture IPC Depends on the architecture, and the compiler

  12. How to improve performance? There are 3 factors: IPC, instructions, and frequency #instructions is dependent on the compiler not on the architecture Let us look at IPC and frequency IPC What is the IPC of an in-order pipeline? 1 if there are no stalls, otherwise < 1

  13. What about frequency? What is frequency dependent on Frequency = 1 / clock period Clock Period: 1 pipeline stage is expected to take 1 clock cycle Clock period = maximum latency of the pipeline stages How to reduce the clock period Make each stage of the pipeline smaller by increasing the number of pipeline stages Use faster transistors

  14. Limits to Increasing Frequency Assume that we have the fastest possible transistors Can we increase the frequency to 100 GHz Reasons

  15. Limits to increasing frequency - II What does it mean to have a very high frequency? Before answering keep these facts in mind: P power f frequency Thumb Rule ? ?3 1 P power T Temperature Thermo- dynamics 2 T ? We need to increase the number of pipeline stages more hazards, more forwarding paths 3

  16. How many pipeline stages can we have? Logic Logic Latch Latch Latch Even more stages More stages Few stages We are limited by the latch delay Even with an infinite number of stages, the minimum clock period will be equal to the latch delay

  17. Pipeline Stages vs IPC CPI = 1/ IPC CPI = CPIideal + stall_rate * stall_penalty The stall rate will remain more or less constant per instruction with the number of pipeline stages The stall penalty (in terms of cycles) will however increase This will lead to a loss in IPC

  18. Summary: Why we cannot increase frequency? Power Temperature Effect of the Latch Delay Stall penalties will increase

  19. Since we cannot increase frequency Increase IPC

  20. Increase IPC Issue more instructions per cycle 2, 4, or 8 instructions Make it a superscalar processor A processor that can execute multiple instructions per cycle

  21. In-order Superscalar Processor Have multiple pipelines. Register Writeback Inst. Fetch Operand Fetch Memory Access Instruction i Execute Register Writeback Inst. Fetch Operand Fetch Memory Access Instruction i+1 Execute Register Writeback Inst. Fetch Operand Fetch Memory Access Instruction i+2 Execute

  22. In-order Superscalar Processor - II There can be dependences between instructions Have O(n2) forwarding paths for a n-issue processor Complicated logic for detecting dependences, hazards, and forwarding Still might not be enough ... To get the peak IPC (= n) in an n-issue pipeline, we need to ensure that there are no stalls There will be no stalls if there are no taken branches, and no data dependences between instructions. Programs typically do not have such long sequences of instructions without dependences

  23. What to do ... Don t follow program order Consider mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r6, r5, 1 add r8, r7, r6 Too many dependences

  24. Execute out of order Execute on a 2-issue OOO processor mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r6, r5, 1 add r8, r7, r6

  25. Execution on a 2-issue OOO Processor issue slot 1 issue slot 2 cycle 1 cycle 2 cycle 3 mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r6, r5, 1 add r8, r7, r6

  26. Basic OOO idea Create a pool of instructions Find instructions that are mutually independent and have all their operands ready Execute them out of order

  27. Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

  28. Revisit the Example Pool of Instructions mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1 add r8, r7, r6 add r6, r5, 1 Issue ready and mutually independent instructions

  29. Pool of Instructions Needs to be large enough such that the requisite number of mutually independent instructions can be found Typical instruction pool sizes: 64 to 128 How do we create a large pool of instructions in a program with branches? We need to be sure that the instructions are on the correct path for (i = 1; i < m; i++) { for (j = 1; j < i; j ++ ) { if (j %2 == 0) continue; .... } }

  30. Problems with creating an Instruction Pool Typically 1 in 5 instructions is a branch Predict the direction of the branches, and their targets

  31. Dependences between Instructions Program Order Dependence mov r1, 1 mov r2, 2 One instruction appears after the other in program order The program order is the order of instructions that is perceived by a single cycle in-order processor executing the program.

  32. Data Dependences RAW Read after Write Dependence (True dependence) mov r1, 1 add r3, r1, r2 It is a producer-consumer dependence. The earlier instruction produces a value, and the later instruction reads it.

  33. Data Dependences - II WAW Write after Write Dependence (Output dependence) mov r1, 1 add r1, r4, r2 Two instructions write to the same location The later instruction needs to take effect after the former

  34. Data Dependences - III WAR Write after Read Dependence (Anti dependence) add r1, r2, r3 add r2, r5, r6 Earlier instruction reads, later instruction writes The later instruction needs to execute after the earlier instruction has read its values

  35. Control Dependences beq .label ..... .label add r1, r2, r3 The add instruction is control dependent on the branch(beq) instruction If the branch is taken then only the add instruction will execute, not otherwise

  36. Basic Results In-order processors respect the program order dependences. Thus, they automatically respect all data and control dependences. OOO processors respect only data and control dependences.

  37. Can output and anti dependences be removed? add r1, r2, r3 add r2, r5, r6 mov r1, 1 add r1, r4, r2 Don t you think that these dependences are there because we have a finite number of registers. What if we had an infinite number of registers?

  38. Solution: Infinite number of registers mov r1, 1 add r1, r2, r3 add r4, r1, 1 mov r2, 5 add r6, r2, r8 mov r1, 8 add r9, r1, r2 mov v1, 1 add v11, v2, v3 add v4, v11, 1 mov v12, 5 add v6, v12, v8 mov v21, 8 add v9, v21, v12 Code with architectural registers Code with virtual registers

  39. Renaming Program with virtual registers Program with real (architectural) registers RAW dependences RAW dependences WAR dependences WAW dependences Higher instruction level parallelism (ILP)

  40. Where are we now ... Pool of Instructions Fetch + Decode + Rename + Reg. Read Issue Memory Execution Units Write back results

  41. Issue with writeback Register File Update State Instructions Processor Memory To an outsider should it matter if the processor is in-order or OOO Answer is NO

  42. Assume that there is an exception or interrupt Divide by 0 mov r4, 10 mov r2, 0 div r3, r1, r2 sub r4, r4, 1 Divide by zero handler Languages like Java have dedicated functions that are called if there is a divide-by-zero in the code. The question is: Would the sub instruction have executed when we enter the exception handler

  43. Precise Exceptions sub inst. Exception handler by 0 Regular Instructions Assume that the exception handler decides to do nothing and return back After this the sub instruction should be executed This is exactly what will happen in an in-order processor In an OOO processor there is a possibility that the sub inst. can execute out of order

  44. Precise Exceptions - II Correct sub inst. Exception handler by 0 Regular Instructions Wrong sub inst. Exception handler by 0 Regular Instructions To an external observer The execution should always be correct Even in the presence of interrupts and exceptions

  45. Precise Exceptions - III We thus need precise exceptions Assume that the dynamic instructions in a program (ordered in program order) are: ins1, ins2, ins3 ... insn Assume that the processor starts the exception/interrupt handler after it has just finished writing the results of instruction: insk Then instructions: ins1 ... insk should have executed completely and written their results to the memory/register file AND, insk+1 onwards should appear to not have started their execution at all Such an exception or interrupt is precise

  46. Precise Exceptions in OOO Processor Now, in an OOO processor how can exceptions be precise? We need a mechanism to ensure that instructions appear to execute in order to not an external observer Anything can happen inside the processor But an external observer should not be able to make out if the processor is in- order or OOO

  47. Contents In-order Pipelines Out-of-order Pipelines: Motivation Out-of-order Pipelines: Basics Branch Prediction

  48. Timing Fetch ith instruction Fetch (i+1)th instruction Predict (i+1)th instruction Fetch (i+2)th instruction Predict (i+2)th instruction Predict (i+3)th instruction There is no time to look at an instruction

  49. Branch Prediction There are several things to predict: 1. Predict if an instruction is a branch or not 2. If it is a branch, predict its outcome 3. If the branch is taken, predict its target General structure of a predictor Program Counter (PC) Prediction Predictor

  50. Is an instruction a branch or not? Insights Given a PC, the status of the instruction (branch or not) does not change Can we use this information? Approach The last time that we had seen a branch remember its PC Next time we see a PC, check if we have seen it before Also remember the type of the branch: Unconditional Conditional Function call Return

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#