ECEC 194: High Performance Computer Architecture

EE194/Comp140 Mark Hempstead
ECEC 194: High Performance
Computer Architecture
Spring 2016
Tufts University
Instructor: Prof. Mark Hempstead
Lecture 3: Review of Branch Prediction,
Memory Hierarchy and Caches
mark@ece.tufts.edu
More pipelining and stalls
Consider a random assembly program
The architecture says that instructions are executed
in order.
The 2
nd
 instruction uses the r2 written by the first
instruction
EE194/Comp140 Mark Hempstead
load 
r2
=mem[r1]
add r5=r3+
r2
add r8=r6+r7
Pipelined instructions
 
Pipelining cheats! It launches instructions before the
previous ones finish.
Hazards occur when one computation uses the results of
another – it exposes our sleight of hand
EE 193 Joel Grodstein
Same problem with branches
 
Same problem as before
we don't know whether to fetch the “sub"
or the "add” until you know r2.
The simplest solution is to stall, just
like before
 
 
EE 193 Joel Grodstein
load 
r2
=mem[r1]
if (r2==0) goto L1
sub r8=r6-r7
L1: add r8=r8+r1
What prediction looks like
Let's predict that r2
0, so we don't take the
branch, and we will execute r8=r6-r7
We start executing it right away, 
before we
know if we are really supposed to
.
What could go wrong with this?
EE 193 Joel Grodstein
load 
r2
=mem[r1]
if (r2==0) goto L1
sub r8=r6-r7
L1: add r8=r8+r1
What prediction looks like
What could go wrong with this?
What if we hit cycle 5 and find that
r2=0?
We must undo the sub, and do the add
instead
EE 193 Joel Grodstein
load 
r2
=mem[r1]
if (r2==0) goto L1
sub r8=r6-r7
L1: add r8=r8+r1
What prediction looks like
Easy, right?
What if the load took longer, so we already
wrote r8?
EE 193 Joel Grodstein
load 
r2
=mem[r1]
if (r2==0) goto L1
sub r8=r6-r7
L1: add r8=r8+r1
What prediction looks like
 
Now, we have a cache miss and it takes longer
for the load to finish.
Problems with this scenario?
We hit cycle 5 and find that r2=0
We want to cancel the "sub" – but it already wrote r8!
We'll talk about how to fix this later
EE 193 Joel Grodstein
load 
r2
=mem[r1]
if (r2==0) goto L1
sub r8=r6-r7
L1: add r8=r8+r1
Branch prediction
 
Guessing if a branch will be taken is called 
branch
prediction
Bottom line: it's hard to fix a wrong prediction
So you better try to get it right most of the time
Now we’ll look at branch  prediction.
Hardware guesses branch outcome
Start fetching from guessed address
Somehow fix things up on a mis-predict.
EE194/Comp140 Mark Hempstead
People branch-predict too
 
Buy an engagement ring, predicting a “yes”
answer
If you guessed wrong, then throw away the ring?
Pick which way to drive to work based on what
the traffic was like yesterday
If today’s backups are different, then make excuses
when you get to work
EE194/Comp140 Mark Hempstead
Branch prediction is hard
 
Making predictions is hard, especially when
they’re about things that haven’t happened yet.
Yogi Berra
 
Corollary: branch prediction is hard
EE194/Comp140 Mark Hempstead
EE194/Comp140 Mark Hempstead
Branch Characteristics
 
Integer Benchmarks: 14-16% instructions are
conditional branches
FP: 3-12%
On Average:
67% of conditional branches are “taken”
60% of forward branches are taken
85% of backward branches are taken. Why?
They tend to be the termination condition of a loop
Simplest of predictors
 
Really simple predictor:
67% of conditional branches are “taken”
So assume branches are always taken
Sounds nice, but being right 67% of the time isn't
nearly good enough
The penalty for being wrong is usually far too high
Ditto for "assume not taken."
EE194/Comp140 Mark Hempstead
Learn from past, predict the future
Record the past in a hardware structure
Direction predictor (DIRP)
Map conditional-branch PC to taken/not-taken (T/N) decision
Individual conditional branches often biased or weakly biased
90%+ one way or the other considered 
“biased”
Why?  Loop back edges, checking for uncommon conditions
Branch history table (BHT)
: simplest predictor
PC indexes table of bits (0 = Not taken, 1 = Taken), no tags
Essentially: branch will go same way it went last time
 
 
 
What about aliasing?
Two PC with the same lower bits?
No problem, just a prediction!
Branch Prediction
T or NT
[9:2]
1:0
[31:10]
T or NT
PC
BHT
Prediction
(taken or not taken)
EE194/Comp140 Mark Hempstead
BHT idx
Not shown:
some way to
update the
BHT after a
mis-predict
Branch History Table (BHT)
Branch history table (BHT)
:
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T),
no tags
Predicts the branch to go same way it
went last time
Problem: 
inner loop branch
 below
for (i=0;i<100;i++)
   for (j=0;
j<3
;j++)
      // whatever
It will be wrong twice per inner loop; once
each at the beginning and the end.
Branch predictor “changes its mind too
quickly”
EE194/Comp140 Mark Hempstead
Branch History Table (BHT)
Branch history table (BHT)
:
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T),
no tags
Predicts the branch to go same way it
went last time
Problem: 
inner loop branch
 below
for (i=0;i<100;i++)
   for (j=0;
j<3
;j++)
      // whatever
It will be wrong twice per inner loop; once
each at the beginning and the end.
Branch predictor “changes its mind too
quickly”
EE194/Comp140 Mark Hempstead
Branch History Table (BHT)
Branch history table (BHT)
:
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T),
no tags
Predicts the branch to go same way it
went last time
Problem: 
inner loop branch
 below
for (i=0;i<100;i++)
   for (j=0;
j<3
;j++)
      // whatever
It will be wrong twice per inner loop; once
each at the beginning and the end.
Branch predictor “changes its mind too
quickly”
EE194/Comp140 Mark Hempstead
Branch History Table (BHT)
Branch history table (BHT)
:
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T),
no tags
Predicts the branch to go same way it
went last time
Problem: 
inner loop branch
 below
for (i=0;i<100;i++)
   for (j=0;
j<3
;j++)
      // whatever
It will be wrong twice per inner loop; once
each at the beginning and the end.
Branch predictor “changes its mind too
quickly”
EE194/Comp140 Mark Hempstead
Branch History Table (BHT)
Branch history table (BHT)
:
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T),
no tags
Predicts the branch to go same way it
went last time
Problem: 
inner loop branch
 below
for (i=0;i<100;i++)
   for (j=0;
j<3
;j++)
      // whatever
It will be wrong twice per inner loop; once
each at the beginning and the end.
Branch predictor “changes its mind too
quickly”
EE194/Comp140 Mark Hempstead
Branch History Table (BHT)
Branch history table (BHT)
:
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T),
no tags
Predicts the branch to go same way it
went last time
Problem: 
inner loop branch
 below
for (i=0;i<100;i++)
   for (j=0;
j<3
;j++)
      // whatever
It will be wrong twice per inner loop; once
each at the beginning and the end.
Branch predictor “changes its mind too
quickly”
EE194/Comp140 Mark Hempstead
Branch History Table (BHT)
Branch history table (BHT)
:
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T),
no tags
Predicts the branch to go same way it
went last time
Problem: 
inner loop branch
 below
for (i=0;i<100;i++)
   for (j=0;
j<4
;j++)
      // whatever
It will be wrong twice per inner loop; once
each at the beginning and the end.
Branch predictor “changes its mind too
quickly”
EE194/Comp140 Mark Hempstead
Two-Bit Saturating Counters
(2bc)
Two-bit saturating counters (2bc)
[Smith 1981]
Replace each single-bit predictor with
(0,1,2,3) = (N,n,t,T)
Adds “hysteresis”
Force predictor to mis-predict twice before
“changing its mind”
One mispredict each loop execution
(rather than two)
+
Improves our inner-loop problem.
EE194/Comp140 Mark Hempstead
ECEC 621 Mark Hempstead
Two-bit Saturating Counters
2-bit FSMs mean prediction must miss twice before change
N-bit predictors are possible, but after 2-bits not much benefit
11
10
00
01
Taken
Taken
Taken
Taken
Not Taken
Not Taken
Not Taken
strongly
taken
weakly
not taken
weakly
taken
strongly
not taken
Not Taken
Two-Bit Saturating Counters
(2bc)
Two-bit saturating counters (2bc)
[Smith 1981]
Replace each single-bit predictor with
(0,1,2,3) = (N,n,t,T)
Adds “hysteresis”
Force predictor to mis-predict twice before
“changing its mind”
One mispredict each loop execution
(rather than two)
+
Improves our inner-loop problem.
EE194/Comp140 Mark Hempstead
Two-Bit Saturating Counters
(2bc)
Two-bit saturating counters (2bc)
[Smith 1981]
Replace each single-bit predictor with
(0,1,2,3) = (N,n,t,T)
Adds “hysteresis”
Force predictor to mis-predict twice before
“changing its mind”
One mispredict each loop execution
(rather than two)
+
Improves our inner-loop problem.
EE194/Comp140 Mark Hempstead
Two-Bit Saturating Counters
(2bc)
Two-bit saturating counters (2bc)
[Smith 1981]
Replace each single-bit predictor with
(0,1,2,3) = (N,n,t,T)
Adds “hysteresis”
Force predictor to mis-predict twice before
“changing its mind”
One mispredict each loop execution
(rather than two)
+
Improves our inner-loop problem.
EE194/Comp140 Mark Hempstead
Two-Bit Saturating Counters
(2bc)
Two-bit saturating counters (2bc)
[Smith 1981]
Replace each single-bit predictor with
(0,1,2,3) = (N,n,t,T)
Adds “hysteresis”
Force predictor to mis-predict twice before
“changing its mind”
One mispredict each loop execution
(rather than two)
+
Improves our inner-loop problem.
EE194/Comp140 Mark Hempstead
Two-Bit Saturating Counters
(2bc)
Two-bit saturating counters (2bc)
[Smith 1981]
Replace each single-bit predictor with
(0,1,2,3) = (N,n,t,T)
Adds “hysteresis”
Force predictor to mis-predict twice before
“changing its mind”
One mispredict each loop execution
(rather than two)
+
Improves our inner-loop problem.
EE194/Comp140 Mark Hempstead
Intuition on how well this works
 
Simple predictors work well:
on branches that repeat the same behavior in streaks.
They work better when the streaks are longer
They don’t work well:
for random or data-dependent branches.
Data dependence can be common in some applications,
but it’s evil for branch prediction.
EE194/Comp140 Mark Hempstead
ECEC 621 Mark Hempstead
0
%
Frequency of  
Mispredictions
0%
1%
5%
6%
6%
11%
4%
6%
5%
1%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
 4,096 entries:  2-bits per entry
 Unlimited entries:  2-bits/entry
 1,024 entries (2,2)
Accuracy of Different Schemes
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
nasa7
matrix300
doducd
spice
fpppp
gcc
expresso
eqntott
li
tomcatv
Branch prediction is hard
Making predictions is hard, especially when
they’re about things that haven’t happened yet.
Yogi Berra
Corollary: we’re not going to find a perfect branch
predictor.
EE194/Comp140 Mark Hempstead
ECEC 621 Mark Hempstead
Hybrid Branch Predictors
Tournament predictors: Adaptively combine local
and global predictors
Different schemes work better for different branches
Local
Predictor
Global
Predictor
Chooser
Predictor
Taken or
Not-taken?
Could be 
2-bit BHT
Branch prediction costs
The BTB and BHT are big arrays that run every
single cycle.
They are on critical paths, and thus must run fast.
They can be big power hogs.
They will never work well if there are data-dependent
branches.
For good single-stream performance, a good branch
prediction is very necessary.
EE194/Comp140 Mark Hempstead
More costs
 
“When you come to a fork in the road, take it”
Yogi Berra
The central idea of branch prediction is that doing nothing
is bad.
Don’t stall. Instead, s
peculatively execute
 a (hopefully pretty good)
prediction
But executing instructions takes energy, and if you have to kill
them then the energy was wasted.
Unless you’re predictors are 
always
 right, then branch
prediction + speculative execution 
always
 wastes energy in
flushed instructions.
Nonetheless, the tradeoff is 
usually
 a good one.
EE194/Comp140 Mark Hempstead
How far should we go?
 
How many transistors should we spend on branch
prediction?
The BP itself costs power and area, but does no
computation.
A BP is necessary for good single-stream performance
We already saw diminishing returns on bigger BHTs
The difficult-to-predict branches are often data
dependent; a better/bigger algorithm won't help much
It would be really nice if we just didn't care about
single-stream performance
But we do – usually.
EE194/Comp140 Mark Hempstead
Takeaways
 
Speculation is necessary to get reasonable single-
stream performance
Waiting for all branches to be resolved is too slow
Branch prediction is needed for speculation to
work well
Good branch prediction is expensive and power
hungry
It's not nice to fool your branch predictor
And it makes your code run really slow
Irregular (e.g., data-dependent) branches
EE194/Comp140 Mark Hempstead
BACKUP
 
EE194/Comp140 Mark Hempstead
Interrupts
Interrupts are hard!
EE194/Comp140 Mark Hempstead
Types of interrupts
What kinds of things can go wrong in a program?
Floating-point errors:
Integer errors
Memory errors
Instruction errors
Others:
EE194/Comp140 Mark Hempstead
 
divide by 0, underflow, overflow
 
divide by 0
 
access unaligned data, illegal
address, virtual memory fault
 
illegal instruction, reserved instruction
 
user-requested breakpoint, I/O error, …
What do we do about it?
Terminate the program. Easy for the architect, not
so nice for the user
It means your program crashes, with no ability to
recover & keep going.
Resume after fixing the problem. Harder for the
architect, but it’s usually what the user wants.
Absolutely needed for, e.g., page faults & for user traps
for single stepping.
EE194/Comp140 Mark Hempstead
EE194/Comp140 Mark Hempstead
Precise interrupts
An interrupt is precise if, after an instruction takes
an interrupt, the machine state is such that
All earlier instructions are completed
The interrupting instruction seemingly never happened
All later instructions seeming never happened.
All interrupts are taken in program order
Seems easy, right? It’s not!
EE194/Comp140 Mark Hempstead
Interrupts can happen backwards
Cycle #3: the “Axx” takes an illegal-instruction
interrupt, since there is no “Axx” instruction
Cycle 4: the LDW takes an unaligned-memory
interrupt, since R2=0x1000 and you cannot read a
word from address 0x1003.
But the later instruction raised its interrupt first!
 
How do you solve this?
 
Exceptions are only handled in WB
EE194/Comp140 Mark Hempstead
Interrupts can still happen
backwards!
The ADD.D and ADD are in separate pipelines; the
floating-point pipe has more EX stages than does the
integer pipe
Cycle #6: the ADD finishes and writes the register
file.
Cycle #7: the ADD.F takes an overflow interrupt.
But the earlier instruction already changed the RF!
Branch-delay slots
 
Consider the code
 
The LD is a delay-slot instruction. MIPS always
executes it after the branch, even if the branch is
taken!
What if LD faults after BEQ is taken?
Must rerun it and then jump to “loop”
But you’re not re-executing the BEQ, so there’s no
reason to jump to loop.
Suddenly restarting after interrupt is harder.
EE194/Comp140 Mark Hempstead
loop: stuff…
BEQ R2, R4, #loop
LD R9, 30(R8)
EE194/Comp140 Mark Hempstead
Summary of Exceptions
Precise interrupts are a headache!
Delayed branches make them a bigger headache.
Preview: OOO makes them a still-bigger headache
More ways to have something write back earlier than the
exception
Some machines punt on the problem
Precise exceptions only for integer pipe
Special “precise mode” used for debugging (10x slower)
Parting quote
Democracy is the worst form of government there
is… except for all of the others.
Winston Churchill
Corollary: the same is true for imprecise
interrupts.
EE194/Comp140 Mark Hempstead
Backup
 
EE194/Comp140 Mark Hempstead
ECEC 621 Mark Hempstead
Branch prediction methods
When is information about branches
gathered/applied?
When the machine is designed
When the program is compiled (“compile-time”)
When a “training run” of the program is executed
(“profile-based”)
As the program is executing (“dynamic”)
Section 3.3 in textbook
EE194/Comp140 Mark Hempstead
Interrupt Taxonomy (C.4)
Synchronous vs. Asynchronous 
(HW error, I/O)
User Request 
(exception?) 
vs. Coerced
User maskable vs. Nonmaskable (Ignorable)
Within vs. Between Instructions
Resume vs. Terminate
The difficult exceptions are 
resumable
 interrupts
within
 instructions
Save the state, correct the cause, restore the state,
continue execution
EE194/Comp140 Mark Hempstead
Interrupt Taxonomy
EE194/Comp140 Mark Hempstead
Interrupts on Instruction Phases
Exceptions can occur on many different phases
However, exceptions are only handled in WB
Why?
load
 
  IF
 
 ID
 
   EX
 
     MEM
 
WB
add
  
 IF
 
   ID
 
     EX
 
              MEM
 
   WB
ECEC 621 Mark Hempstead
Tournament Predictors
Tournament predictor using, say, 4K 2-bit counters indexed by
local branch address.  Chooses between:
Global predictor
4K entries index by history of last 12 branches (2
12 = 
4K)
Each entry is a standard 2-bit predictor
Local predictor
Local history table: 1024 10-bit entries recording last 10 branches,
index by branch address
The pattern of the last 10 occurrences of that particular branch used
to index table of 1K entries with 3-bit saturating counters
ECEC 621 Mark Hempstead
Comparing Predictors (Fig. 2.8)
Advantage of tournament predictor is ability to select the
right predictor for a particular branch
Particularly crucial for integer benchmarks.
A typical tournament predictor will select the global predictor almost
40% of the time for the SPEC integer benchmarks and less than 15%
of the time for the SPEC FP benchmarks
Core i7 Branch Predictor
2-levels of predictors
Simple 1
st
 level predictor
Slower more accurate 2
nd
level predictor
Each predictor has three
components
Simple 2-bit predictor
(Appendix C)
Global history predictor
Loop exit predictor
ECEC 621
Mark Hempstead
ECEC 621 Mark Hempstead
Why predict? Speculative Execution
Execute 
beyond
 branch boundaries before the
branch is resolved
Correct Speculation
Avoid stall, result is computed early, performance++
Incorrect Speculation
Abort/squash incorrect instructions, complexity+
Undo any incorrect state changes, complexity++
Performance gain is weighed vs. penalty
Speculation accuracy = branch prediction accuracy
ECEC 621 Mark Hempstead
Branch Direction Prediction
Basic idea: Hope that future behavior of the
branch is correlated to past behavior
Loops
Error-checking conditionals
For a single branch PC
Simplest possible idea: Keep 1 bit around to indicate
taken or not-taken
2
nd
 simplest idea: Keep 2 bits around, saturating counter
Store a “cache” of counters for recent PC
addresses. Use LSB of PC as index into cache
ECEC 621 Mark Hempstead
Dynamic Branch Prediction
 Why does prediction work?
Underlying algorithm has regularities
Data that is being operated on has regularities
Instruction sequence has redundancies that are artifacts of
way that humans/compilers think about problems
Is dynamic branch prediction better than static branch
prediction?
Seems to be
There are a small number of important branches in programs
which have dynamic behavior
ECEC 621 Mark Hempstead
Branch Prediction Buffer
(branch history table, BHT)
Small memory indexed with low bits of the
branch instruction’s address
Why the low bits?
Implementation
Separate memory accessed during IF phase
2-bits attached to each block in the Instruction
Cache
Caveats: Cannot separately size I-Cache and BHT
What about multiple branches in a cache line?
Does this help our simple 5-stage pipeline?
PC
12-bits
2
12
 = 4K Entries
Taken or
Not-taken?
ECEC 621 Mark Hempstead
2-level Predictor with Global History
Global history -- 
1 Branch History Register (BHR)
Per-address/set history
Per-Address/set Branch History Table holds many BHRs
Concatenate global k-bit (shift register) with PC bits to
index PHT
PC
Taken or
Not-taken?
K-bits
EE194/Comp140 Mark Hempstead
How to take an exception
1.
Force a trap instruction on the next IF
2.
Squash younger instructions (Turn off all writes
(register/memory) for faulting instruction and all
instructions that follow it)
3.
Save all processor state after trap begins
PC-chain, PSW, Condition Codes, trap condition
PC-chain is length of the branch delay plus 1
4.
Perform the trap/exception code then restart
where we left off
Extra-credit problem
Pronounce your own name.
Pronounce “Donald Knuth.”
Which of the following would make the best
president?
Donald Trump
Donald Knuth
Donald Duck
Worth 1 point on the midterm.
EE194/Comp140 Mark Hempstead
Branch Prediction Performance
Parameters
Branch: 20%
, load: 20%, store: 10%, other: 50%
75% of branches are taken
Dynamic branch prediction
Branches predicted with 95% accuracy
CPI = 1 + 20% * 5% * 2 = 
1.02
EE194/Comp140 Mark Hempstead
Dynamic Branch Prediction Components
Step #1: is it a branch?
Easy after decode...
Step #2: is the branch taken or not taken?
Direction predictor 
(applies to conditional branches only)
Predicts taken/not-taken
Step #3: if the branch is taken, where does it go?
Branch 
target predictor
Easy after decode…
regfile
D$
I$
B
P
EE194/Comp140 Mark Hempstead
ECEC 621 Mark Hempstead
Two-bit Saturating Counters,
take 2
Slightly different version works better for some benchmarks.
11
10
01
00
Taken
Taken
Taken
Taken
Not Taken
Not Taken
Not Taken
Not Taken
strongly
taken
strongly
not taken
weakly
not taken
weakly
taken
Branch Target Buffer (BTB)
It’s little use predicting taken/not early in ID if you don’t know the
target address until late in EX.
So we record the past branch targets in a hardware structure
Branch target buffer (BTB):
“Last time the branch X was taken, it went to address Y”
“So, in the future, if address X is fetched, fetch address Y next”
Works well because most control insns use 
direct targets
Target encoded in insn itself 
 
same “taken” target every time
What about 
indirect targets
?
Target held in a register
 
 
can be different each time
Two indirect call idioms
Dynamically linked functions (DLLs): target always the same
Dynamically dispatched (virtual) functions: hard but uncommon
Also two indirect unconditional jump idioms
Switches: hard but uncommon
Function returns: hard and common but…
EE194/Comp140 Mark Hempstead
Implementation
A small RAM: address = PC, data = target-PC
Access at Fetch 
in parallel
 with instruction memory
predicted-target = BTB[hash(PC)]
Updated whenever target != predicted-target
BTB[hash(PC)] = correct-target
Very similar to BHT, but it’s a full address width and not
just one taken/not bit.
Aliasing?
EE194/Comp140 Mark Hempstead
Target address
[9:2]
1:0
[31:10]
PC
BTB
BTB idx
 
No problem, this is only a prediction
Target address
Return Address Stack (RAS)
 
Return address stack (RAS)
Call instruction? RAS[TopOfStack++] = PC+4
Return instruction? Predicted-target = RAS[--TopOfStack]
Q: how can you tell if an insn is a call/return before decoding it?
Accessing RAS on every instruction would waste power
Answer:
another predictor (or put them in BTB marked as “return”)
Or, 
pre-decode bits
 in insn mem, written when first executed
PC
+
4
BTB
tag
==
target
predicted target
RAS
EE194/Comp140 Mark Hempstead
Putting It All Together
BTB & branch direction predictor during fetch
If
 branch prediction correct, then no penalty for
branches!
PC
+
4
BTB
tag
==
target
predicted target
RAS
BHT
taken/not-taken
EE194/Comp140 Mark Hempstead
Branch Prediction Performance
Dynamic branch prediction
20% of instruction branches
Simple predictor: branches predicted with 75% accuracy
CPI = 1 + (20% * 
25% 
* 2) = 
1.1
More advanced predictor: 95% accuracy
CPI = 1 + (20% * 
 5% 
* 2) = 
1.02
Branch mis-predictions still a big problem though
Pipelines are long: typical mis-prediction penalty is 10+
cycles
For cores that do more per cycle, predictions most costly
(later)
Many other methods for building a better branch
predictor  including correlating predictors, global
history tables(Chapter 3)
EE194/Comp140 Mark Hempstead
Integer
ECEC 621 Mark Hempstead
BHT Accuracy
 
Mispredict because either:
Wrong guess for that branch
Got branch history of wrong branch due to aliasing (4K-entry
table shown here).
Floating Point
Another way:
Prediction
accuracy between
99% and 82%
ADVANCED BRANCH
PREDICTION (3.3)
 
ECEC 621
Mark Hempstead
ECEC 621 Mark Hempstead
Correlating Predictors
 
2-bit scheme only looks at branch’s 
own
 history to
predict its behavior
What if we use other branches to predict it as well?
if (aa==2)aa=0;
if (bb==2)bb=0;
if (aa!=bb){..}
Does branch #3 depends on outcome of #1 and #2?
Yes. If #1 & #2 are both taken, then #3 will not be taken.
Sometimes the past can predict the future with certainty!
This is called “global history.”
// Branch #1
// Branch #2
// Branch #3
Another example of global
history
Consider our inner loop again
The branch pattern is (T,T,T,N) repeating
Consider the following global rule:
If the last branches were “T,T,T” then predict not taken
Else predict taken.
Works perfectly – the trick is having the predictor
figure out that rule.
Let’s make a predictor that looks at both the
history at this branch, as well as recent outcomes
of all branches (including this one).
EE194/Comp140 Mark Hempstead
ECEC 621 Mark Hempstead
Branch History Register
Simple shift register
Shift in branch outcomes as they occur
1 => branch was taken
0 => branch was not-taken
k-bit BHR => 2
k
 patterns
Now we know the outcome of the last 
k
 branches.
Use these patterns to address into the Pattern History Table.
Effectively the PHT is a separate BHT for each global
pattern.
ECEC 621 Mark Hempstead
Generic Adaptive Branch
Prediction (Correlating Predictor)
Two-level BP requires two main
components
Branch history register (BHR):
recent outcomes of branches (last 
k
branches encountered)
Pattern History Table (PHT):
branch behavior for recent
occurrences of the specific pattern
of these 
k
 branches
In effect, we concatenate BHR with
Branch PC bits
We’ve drawn a (2,2) predictor
PC
12-bits
2
12
 = 4K Entries each (PHTs)
Taken or
Not-taken?
2-bit BHR
0
0
ECEC 621 Mark Hempstead
Pattern History Table
Has 2
k
 entries
Usually uses a 2-bit counter for the prediction
BHR is used to address the PHT. I.e., a (k,2) PHT
is an array of 2
k
 BHTs, each of which has a 2-bit
counter.
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PBHR, PC) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
Correlated (two-level)
predictor 
[Patt 1991]
Exploits observation that
branch outcomes are correlated
Maintains separate prediction
per (PC, BHR) pairs
Branch history register (BHR)
:
recent branch outcomes
Simple working example:
assume program has one branch
BHT: one 1-bit DIRP entry
BHT+
2BHR
: 2
2
 = 
4
 1-bit DIRP
entries
(2,1) correlated Predictor
 
What went wrong?
Look at the grey squares; they
log our mistakes.
For NN, NT, TN we only made
one mistake. But TT kept
making mistakes. Why?
We have both T,T,T and also
T,T,N.
Every time we went from one to
the other we got the wrong answer
and had to re-train our predictor.
How do we fix this simply?
Go to a (3,1) predictor.
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
3-bit BHR
2
3
 DIRP
entries per
pattern
One-bit BHT
(not a 2-bit
saturating
counter)
(3,1) correlated predictor
With 3 bits of
history, the
history is now
enough to fully
predict the
branch result.
Any column
has only one
grey square;
once we train
the predictor
for a given
history it never
makes another
mistake.
ECEC 621 Mark Hempstead
Hardware Costs of 2-level
predictions
(m,n) predictor 
 m-bits of global history, n-bit
predictor
2
m
*n*Number of prediction entries
Say you have m-bits of history (m=2)
n-bits of predictor per entries (n=2)
(2,2) predictor with 1K prediction entries
 
2
2
*2*1024 = 8K-bits
EE194/Comp140 Mark Hempstead
Outline
Control Hazards and Branch Prediction
Interrupts
Memory Hierarchy
Cache review
Slide Note
Embed
Share

This content delves into the concepts of branch prediction, memory hierarchy, and caches in high-performance computer architecture. Topics covered include pipelined instructions, handling stalls, and addressing hazards with branches. The discussion explores pipelining challenges and the complexities of predicting outcomes in various instruction cycles.

  • Computer Architecture
  • Branch Prediction
  • Memory Hierarchy
  • Caches
  • Pipelining

Uploaded on Feb 26, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ECEC 194: High Performance Computer Architecture Spring 2016 Tufts University Instructor: Prof. Mark Hempstead mark@ece.tufts.edu Lecture 3: Review of Branch Prediction, Memory Hierarchy and Caches EE194/Comp140 Mark Hempstead 1

  2. More pipelining and stalls Consider a random assembly program load r2=mem[r1] add r5=r3+r2 add r8=r6+r7 The architecture says that instructions are executed in order. The 2nd instruction uses the r2 written by the first instruction EE194/Comp140 Mark Hempstead 2

  3. Pipelined instructions Instruction load r2=mem[r1] fetch Cycle 1 Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8 read R1 execute nothing cache fetch read r3,r2 r3,r2 fetch read r6,r7 fetch access write r2 add r5=r3+r2 add load/st nothing add r6,r7 read r9,r10 write r5 add r8=r6+r7 load/st nothing execute nothing write r8 store mem[r9]=r10 store write nothing Pipelining cheats! It launches instructions before the previous ones finish. Hazards occur when one computation uses the results of another it exposes our sleight of hand EE 193 Joel Grodstein 3

  4. Same problem with branches Instruction Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2=0 fetch read r2 sub r8=r6-r7 access cache read r2 write r2 read r2 fetch read r6,r7 add r6,r7 fetch read load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 Same problem as before we don't know whether to fetch the sub" or the "add until you know r2. The simplest solution is to stall, just like before EE 193 Joel Grodstein 4

  5. What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 sub r8=r6-r7 fetch Instruction access cache read r2 read r6,r7 add r6,r7 write r2 read r2 access nothing execute write r8 fetch read access Let's predict that r2 0, so we don't take the branch, and we will execute r8=r6-r7 We start executing it right away, before we know if we are really supposed to. What could go wrong with this? load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 5

  6. What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 sub r8=r6-r7 fetch Instruction access cache read r2 read r6,r7 add r6,r7 write r2 read r2 access nothing execute write r8 fetch read access What could go wrong with this? What if we hit cycle 5 and find that r2=0? We must undo the sub, and do the add instead load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 6

  7. What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 add r8=r8+1 Instruction access cache read r2 write r2 read r2 fetch read r8 Easy, right? What if the load took longer, so we already wrote r8? load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 7

  8. What prediction looks like Cycle 1 2 load r2=mem[r1]fetch read R1 nothing branch if r2==0 fetch read r2 read r2 read r2 read r2 sub r8=r6-r7 fetch Instruction Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 execute access cache access cache access cache access cache read r2 write r8 write r2 read r2 read r6,r7 fetch add r6,r7 read access nothing execute access load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 Now, we have a cache miss and it takes longer for the load to finish. Problems with this scenario? We hit cycle 5 and find that r2=0 We want to cancel the "sub" but it already wrote r8! We'll talk about how to fix this later EE 193 Joel Grodstein 8

  9. Branch prediction Guessing if a branch will be taken is called branch prediction Bottom line: it's hard to fix a wrong prediction So you better try to get it right most of the time Now we ll look at branch prediction. Hardware guesses branch outcome Start fetching from guessed address Somehow fix things up on a mis-predict. EE194/Comp140 Mark Hempstead 9

  10. People branch-predict too Buy an engagement ring, predicting a yes answer If you guessed wrong, then throw away the ring? Pick which way to drive to work based on what the traffic was like yesterday If today s backups are different, then make excuses when you get to work EE194/Comp140 Mark Hempstead 10

  11. Branch prediction is hard Making predictions is hard, especially when they re about things that haven t happened yet. Yogi Berra Corollary: branch prediction is hard EE194/Comp140 Mark Hempstead 11

  12. Branch Characteristics Integer Benchmarks: 14-16% instructions are conditional branches FP: 3-12% On Average: 67% of conditional branches are taken 60% of forward branches are taken 85% of backward branches are taken. Why? They tend to be the termination condition of a loop EE194/Comp140 Mark Hempstead 12

  13. Simplest of predictors Really simple predictor: 67% of conditional branches are taken So assume branches are always taken Sounds nice, but being right 67% of the time isn't nearly good enough The penalty for being wrong is usually far too high Ditto for "assume not taken." EE194/Comp140 Mark Hempstead 13

  14. Branch Prediction Learn from past, predict the future Record the past in a hardware structure Direction predictor (DIRP) Map conditional-branch PC to taken/not-taken (T/N) decision Individual conditional branches often biased or weakly biased 90%+ one way or the other considered biased Why? Loop back edges, checking for uncommon conditions Branch history table (BHT): simplest predictor PC indexes table of bits (0 = Not taken, 1 = Taken), no tags Essentially: branch will go same way it went last time BHT Not shown: some way to update the BHT after a mis-predict PC [31:10] [9:2] 1:0 T or NT BHT idx What about aliasing? Two PC with the same lower bits? No problem, just a prediction! T or NT Prediction (taken or not taken) EE194/Comp140 Mark Hempstead 14

  15. Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 2 3 4 5 6 7 8 9 10 11 12 State Result? T T T N T T T N T T T N EE194/Comp140 Mark Hempstead 15

  16. Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 4 5 6 7 8 9 10 11 12 State Result? N T Wrong EE194/Comp140 Mark Hempstead 16

  17. Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 5 6 7 8 9 10 11 12 State Result? N T T T Wrong Correct EE194/Comp140 Mark Hempstead 17

  18. Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 6 7 8 9 10 11 12 State Result? N T T T T T Wrong Correct Correct EE194/Comp140 Mark Hempstead 18

  19. Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 7 8 9 10 11 12 State Result? N T T T T T T N Wrong Correct Correct Wrong EE194/Comp140 Mark Hempstead 19

  20. Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 T 7 8 9 10 11 12 State Result? N T T T N T T T N T Wrong Correct Correct Wrong Wrong EE194/Comp140 Mark Hempstead 20

  21. Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<4;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 T 7 T 8 T 9 N 10 T 11 T 12 T State Result? N T T T N T T T N T T T T T T N T T T N T T T N Wrong Correct Correct Wrong Wrong Correct Correct Wrong Wrong Correct Correct Wrong EE194/Comp140 Mark Hempstead 21

  22. Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N T Wrong n EE194/Comp140 Mark Hempstead 22

  23. Two-bit Saturating Counters Taken Not Taken strongly taken weakly taken Predict Taken 11 Predict Taken 10 Taken Taken Not Taken strongly not taken weakly not taken Taken Predict Not Taken 00 Not Taken Predict Not Taken 01 Not Taken 2-bit FSMs mean prediction must miss twice before change N-bit predictors are possible, but after 2-bits not much benefit ECEC 621 Mark Hempstead 23

  24. Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T Wrong Wrong n t EE194/Comp140 Mark Hempstead 24

  25. Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T Wrong Wrong Correct n t EE194/Comp140 Mark Hempstead 25

  26. Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T N Wrong Wrong Correct Wrong n t t EE194/Comp140 Mark Hempstead 26

  27. Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 T 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T T N T Wrong Wrong Correct Wrong Correct n t t EE194/Comp140 Mark Hempstead 27

  28. Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 T 7 T 8 T 9 10 T 11 T 12 T Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T T T T T T T T T N T T T N T T T N Wrong Wrong Correct Wrong Correct Correct Correct Wrong Correct Correct Correct Wrong n t t t EE194/Comp140 Mark Hempstead 28

  29. Intuition on how well this works Simple predictors work well: on branches that repeat the same behavior in streaks. They work better when the streaks are longer They don t work well: for random or data-dependent branches. Data dependence can be common in some applications, but it s evil for branch prediction. EE194/Comp140 Mark Hempstead 29

  30. Accuracy of Different Schemes 20% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% Frequency of Mispredictions 16% 14% 12% 11% 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0 % spice gcc fpppp nasa7 doducd matrix300 li tomcatv expresso eqntott 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) ECEC 621 Mark Hempstead 30

  31. Branch prediction is hard Making predictions is hard, especially when they re about things that haven t happened yet. Yogi Berra Corollary: we re not going to find a perfect branch predictor. EE194/Comp140 Mark Hempstead 31

  32. Hybrid Branch Predictors Tournament predictors: Adaptively combine local and global predictors Different schemes work better for different branches Local Predictor Global Predictor Could be 2-bit BHT Chooser Predictor Taken or Not-taken? ECEC 621 Mark Hempstead 32

  33. Branch prediction costs The BTB and BHT are big arrays that run every single cycle. They are on critical paths, and thus must run fast. They can be big power hogs. They will never work well if there are data-dependent branches. For good single-stream performance, a good branch prediction is very necessary. EE194/Comp140 Mark Hempstead 33

  34. More costs When you come to a fork in the road, take it Yogi Berra The central idea of branch prediction is that doing nothing is bad. Don t stall. Instead, speculatively execute a (hopefully pretty good) prediction But executing instructions takes energy, and if you have to kill them then the energy was wasted. Unless you re predictors are always right, then branch prediction + speculative execution always wastes energy in flushed instructions. Nonetheless, the tradeoff is usually a good one. EE194/Comp140 Mark Hempstead 34

  35. How far should we go? How many transistors should we spend on branch prediction? The BP itself costs power and area, but does no computation. A BP is necessary for good single-stream performance We already saw diminishing returns on bigger BHTs The difficult-to-predict branches are often data dependent; a better/bigger algorithm won't help much It would be really nice if we just didn't care about single-stream performance But we do usually. EE194/Comp140 Mark Hempstead 35

  36. Takeaways Speculation is necessary to get reasonable single- stream performance Waiting for all branches to be resolved is too slow Branch prediction is needed for speculation to work well Good branch prediction is expensive and power hungry It's not nice to fool your branch predictor And it makes your code run really slow Irregular (e.g., data-dependent) branches EE194/Comp140 Mark Hempstead 36

  37. BACKUP EE194/Comp140 Mark Hempstead 37

  38. Interrupts Interrupts are hard! EE194/Comp140 Mark Hempstead 38

  39. Types of interrupts What kinds of things can go wrong in a program? Floating-point errors: Integer errors Memory errors access unaligned data, illegal address, virtual memory fault divide by 0, underflow, overflow divide by 0 Instruction errors Others: illegal instruction, reserved instruction user-requested breakpoint, I/O error, EE194/Comp140 Mark Hempstead 39

  40. What do we do about it? Terminate the program. Easy for the architect, not so nice for the user It means your program crashes, with no ability to recover & keep going. Resume after fixing the problem. Harder for the architect, but it s usually what the user wants. Absolutely needed for, e.g., page faults & for user traps for single stepping. EE194/Comp140 Mark Hempstead 40

  41. Precise interrupts An interrupt is precise if, after an instruction takes an interrupt, the machine state is such that All earlier instructions are completed The interrupting instruction seemingly never happened All later instructions seeming never happened. All interrupts are taken in program order Seems easy, right? It s not! EE194/Comp140 Mark Hempstead 41

  42. Interrupts can happen backwards Cycle LDW R3, R2, #3 Axx R4, R3, R5 1 IF 2 ID IF 3 EX ID 4 MEM EX 5 WB MEM 6 WB Cycle #3: the Axx takes an illegal-instruction interrupt, since there is no Axx instruction Cycle 4: the LDW takes an unaligned-memory interrupt, since R2=0x1000 and you cannot read a word from address 0x1003. But the later instruction raised its interrupt first! How do you solve this? Exceptions are only handled in WB EE194/Comp140 Mark Hempstead 42

  43. Interrupts can still happen backwards! Cycle ADD.D F3,F2,F1 ADD R4, R3, R5 1 IF 2 ID IF 3 EX1 ID 4 EX2 EX 5 EX3 MEM WB 6 EX4 EX5 7 8 WB The ADD.D and ADD are in separate pipelines; the floating-point pipe has more EX stages than does the integer pipe Cycle #6: the ADD finishes and writes the register file. Cycle #7: the ADD.F takes an overflow interrupt. But the earlier instruction already changed the RF! EE194/Comp140 Mark Hempstead 43

  44. Branch-delay slots loop: stuff BEQ R2, R4, #loop LD R9, 30(R8) Consider the code The LD is a delay-slot instruction. MIPS always executes it after the branch, even if the branch is taken! What if LD faults after BEQ is taken? Must rerun it and then jump to loop But you re not re-executing the BEQ, so there s no reason to jump to loop. Suddenly restarting after interrupt is harder. EE194/Comp140 Mark Hempstead 44

  45. Summary of Exceptions Precise interrupts are a headache! Delayed branches make them a bigger headache. Preview: OOO makes them a still-bigger headache More ways to have something write back earlier than the exception Some machines punt on the problem Precise exceptions only for integer pipe Special precise mode used for debugging (10x slower) EE194/Comp140 Mark Hempstead 45

  46. Parting quote Democracy is the worst form of government there is except for all of the others. Winston Churchill Corollary: the same is true for imprecise interrupts. EE194/Comp140 Mark Hempstead 46

  47. Backup EE194/Comp140 Mark Hempstead 47

  48. Branch prediction methods When is information about branches gathered/applied? When the machine is designed When the program is compiled ( compile-time ) When a training run of the program is executed ( profile-based ) As the program is executing ( dynamic ) Section 3.3 in textbook ECEC 621 Mark Hempstead 48

  49. Interrupt Taxonomy (C.4) Synchronous vs. Asynchronous (HW error, I/O) User Request (exception?) vs. Coerced User maskable vs. Nonmaskable (Ignorable) Within vs. Between Instructions Resume vs. Terminate The difficult exceptions are resumable interrupts within instructions Save the state, correct the cause, restore the state, continue execution EE194/Comp140 Mark Hempstead 49

  50. Interrupt Taxonomy Exception Type Sync vs. Async User Request Vs. Coerced User mask vs. nommask Within vs. Between Insn Resume vs. terminate I/O Device Req. Async Coerced Nonmask Between Resume Invoke O/S Sync User Nonmask Between Resume Tracing Instructions Sync User Maskable Between Resume Breakpoint Sync User Maskable Between Resume Arithmetic Overflow Sync Coerced Maskable Within Resume Page Fault (not in main m) Sync Coerced Nonmask Within Resume Misaligned Memory Sync Coerced Maskable Within Resume Mem. Protection Violation Sync Coerced Nonmask Within Resume Using Undefined Insns Sync Coerced Nonmask Within Terminate Hardware/Power Failure Async Coerced Nonmask Within Terminate EE194/Comp140 Mark Hempstead 50

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#