ECEC 194: High Performance Computer Architecture

EE194/Comp140 Mark Hempstead

ECEC 194: High Performance

Computer Architecture

Spring 2016

Tufts University

Instructor: Prof. Mark Hempstead

Lecture 3: Review of Branch Prediction,

Memory Hierarchy and Caches

mark@ece.tufts.edu

More pipelining and stalls

•

Consider a random assembly program

•

The architecture says that instructions are executed

in order.

–

The 2

nd

 instruction uses the r2 written by the first

instruction

EE194/Comp140 Mark Hempstead

load

r2

=mem[r1]

add r5=r3+

r2

add r8=r6+r7

Pipelined instructions

•

Pipelining cheats! It launches instructions before the

previous ones finish.

•

Hazards occur when one computation uses the results of

another – it exposes our sleight of hand

EE 193 Joel Grodstein

Same problem with branches

•

Same problem as before

–

we don't know whether to fetch the “sub"

or the "add” until you know r2.

•

The simplest solution is to stall, just

like before

EE 193 Joel Grodstein

load

r2

=mem[r1]

if (r2==0) goto L1

sub r8=r6-r7

…

L1: add r8=r8+r1

What prediction looks like

•

Let's predict that r2



0, so we don't take the

branch, and we will execute r8=r6-r7

•

We start executing it right away,

before we

know if we are really supposed to

•

What could go wrong with this?

EE 193 Joel Grodstein

load

r2

=mem[r1]

if (r2==0) goto L1

sub r8=r6-r7

…

L1: add r8=r8+r1

What prediction looks like

•

What could go wrong with this?

–

What if we hit cycle 5 and find that

r2=0?

–

We must undo the sub, and do the add

instead

EE 193 Joel Grodstein

load

r2

=mem[r1]

if (r2==0) goto L1

sub r8=r6-r7

…

L1: add r8=r8+r1

What prediction looks like

•

Easy, right?

•

What if the load took longer, so we already

wrote r8?

EE 193 Joel Grodstein

load

r2

=mem[r1]

if (r2==0) goto L1

sub r8=r6-r7

…

L1: add r8=r8+r1

What prediction looks like

•

Now, we have a cache miss and it takes longer

for the load to finish.

•

Problems with this scenario?

–

We hit cycle 5 and find that r2=0

–

We want to cancel the "sub" – but it already wrote r8!

–

We'll talk about how to fix this later

EE 193 Joel Grodstein

load

r2

=mem[r1]

if (r2==0) goto L1

sub r8=r6-r7

…

L1: add r8=r8+r1

Branch prediction

•

Guessing if a branch will be taken is called

branch

prediction

•

Bottom line: it's hard to fix a wrong prediction

–

So you better try to get it right most of the time

•

Now we’ll look at branch  prediction.

–

Hardware guesses branch outcome

–

Start fetching from guessed address

–

Somehow fix things up on a mis-predict.

EE194/Comp140 Mark Hempstead

People branch-predict too

•

Buy an engagement ring, predicting a “yes”

answer

–

If you guessed wrong, then throw away the ring?

•

Pick which way to drive to work based on what

the traffic was like yesterday

–

If today’s backups are different, then make excuses

when you get to work

EE194/Comp140 Mark Hempstead

Branch prediction is hard

•

Making predictions is hard, especially when

they’re about things that haven’t happened yet.

–

Yogi Berra

•

Corollary: branch prediction is hard

EE194/Comp140 Mark Hempstead

EE194/Comp140 Mark Hempstead

Branch Characteristics

•

Integer Benchmarks: 14-16% instructions are

conditional branches

•

FP: 3-12%

•

On Average:

–

67% of conditional branches are “taken”

–

60% of forward branches are taken

–

85% of backward branches are taken. Why?

–

They tend to be the termination condition of a loop

Simplest of predictors

•

Really simple predictor:

–

67% of conditional branches are “taken”

–

So assume branches are always taken

•

Sounds nice, but being right 67% of the time isn't

nearly good enough

–

The penalty for being wrong is usually far too high

–

Ditto for "assume not taken."

EE194/Comp140 Mark Hempstead

•

Learn from past, predict the future

–

Record the past in a hardware structure

•

Direction predictor (DIRP)

–

Map conditional-branch PC to taken/not-taken (T/N) decision

–

Individual conditional branches often biased or weakly biased

•

90%+ one way or the other considered

“biased”

•

Why?  Loop back edges, checking for uncommon conditions

•

Branch history table (BHT)

: simplest predictor

–

PC indexes table of bits (0 = Not taken, 1 = Taken), no tags

–

Essentially: branch will go same way it went last time

–

What about aliasing?

•

Two PC with the same lower bits?

•

No problem, just a prediction!

Branch Prediction

T or NT

[9:2]

1:0

[31:10]

T or NT

PC

BHT

Prediction

(taken or not taken)

EE194/Comp140 Mark Hempstead

BHT idx

Not shown:

some way to

update the

BHT after a

mis-predict

Branch History Table (BHT)

•

Branch history table (BHT)

simplest direction predictor

–

PC indexes table of bits (0 = N, 1 = T),

no tags

–

Predicts the branch to go same way it

went last time

–

Problem:

inner loop branch

 below

for (i=0;i<100;i++)

   for (j=0;

j<3

;j++)

      // whatever

–

It will be wrong twice per inner loop; once

each at the beginning and the end.

–

Branch predictor “changes its mind too

quickly”

EE194/Comp140 Mark Hempstead

Branch History Table (BHT)

•

Branch history table (BHT)

simplest direction predictor

–

PC indexes table of bits (0 = N, 1 = T),

no tags

–

Predicts the branch to go same way it

went last time

–

Problem:

inner loop branch

 below

for (i=0;i<100;i++)

   for (j=0;

j<3

;j++)

      // whatever

–

It will be wrong twice per inner loop; once

each at the beginning and the end.

–

Branch predictor “changes its mind too

quickly”

EE194/Comp140 Mark Hempstead

Branch History Table (BHT)

•

Branch history table (BHT)

simplest direction predictor

–

PC indexes table of bits (0 = N, 1 = T),

no tags

–

Predicts the branch to go same way it

went last time

–

Problem:

inner loop branch

 below

for (i=0;i<100;i++)

   for (j=0;

j<3

;j++)

      // whatever

–

It will be wrong twice per inner loop; once

each at the beginning and the end.

–

Branch predictor “changes its mind too

quickly”

EE194/Comp140 Mark Hempstead

Branch History Table (BHT)

•

Branch history table (BHT)

simplest direction predictor

–

PC indexes table of bits (0 = N, 1 = T),

no tags

–

Predicts the branch to go same way it

went last time

–

Problem:

inner loop branch

 below

for (i=0;i<100;i++)

   for (j=0;

j<3

;j++)

      // whatever

–

It will be wrong twice per inner loop; once

each at the beginning and the end.

–

Branch predictor “changes its mind too

quickly”

EE194/Comp140 Mark Hempstead

Branch History Table (BHT)

•

Branch history table (BHT)

simplest direction predictor

–

PC indexes table of bits (0 = N, 1 = T),

no tags

–

Predicts the branch to go same way it

went last time

–

Problem:

inner loop branch

 below

for (i=0;i<100;i++)

   for (j=0;

j<3

;j++)

      // whatever

–

It will be wrong twice per inner loop; once

each at the beginning and the end.

–

Branch predictor “changes its mind too

quickly”

EE194/Comp140 Mark Hempstead

Branch History Table (BHT)

•

Branch history table (BHT)

simplest direction predictor

–

PC indexes table of bits (0 = N, 1 = T),

no tags

–

Predicts the branch to go same way it

went last time

–

Problem:

inner loop branch

 below

for (i=0;i<100;i++)

   for (j=0;

j<3

;j++)

      // whatever

–

It will be wrong twice per inner loop; once

each at the beginning and the end.

–

Branch predictor “changes its mind too

quickly”

EE194/Comp140 Mark Hempstead

Branch History Table (BHT)

•

Branch history table (BHT)

simplest direction predictor

–

PC indexes table of bits (0 = N, 1 = T),

no tags

–

Predicts the branch to go same way it

went last time

–

Problem:

inner loop branch

 below

for (i=0;i<100;i++)

   for (j=0;

j<4

;j++)

      // whatever

–

It will be wrong twice per inner loop; once

each at the beginning and the end.

–

Branch predictor “changes its mind too

quickly”

EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters

(2bc)

•

Two-bit saturating counters (2bc)

[Smith 1981]

–

Replace each single-bit predictor with

•

(0,1,2,3) = (N,n,t,T)

–

Adds “hysteresis”

•

Force predictor to mis-predict twice before

“changing its mind”

–

One mispredict each loop execution

(rather than two)

Improves our inner-loop problem.

EE194/Comp140 Mark Hempstead

ECEC 621 Mark Hempstead

Two-bit Saturating Counters

•

2-bit FSMs mean prediction must miss twice before change

•

N-bit predictors are possible, but after 2-bits not much benefit

Taken

Taken

Taken

Taken

Not Taken

Not Taken

Not Taken

strongly

taken

weakly

not taken

weakly

taken

strongly

not taken

Not Taken

Two-Bit Saturating Counters

(2bc)

•

Two-bit saturating counters (2bc)

[Smith 1981]

–

Replace each single-bit predictor with

•

(0,1,2,3) = (N,n,t,T)

–

Adds “hysteresis”

•

Force predictor to mis-predict twice before

“changing its mind”

–

One mispredict each loop execution

(rather than two)

Improves our inner-loop problem.

EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters

(2bc)

•

Two-bit saturating counters (2bc)

[Smith 1981]

–

Replace each single-bit predictor with

•

(0,1,2,3) = (N,n,t,T)

–

Adds “hysteresis”

•

Force predictor to mis-predict twice before

“changing its mind”

–

One mispredict each loop execution

(rather than two)

Improves our inner-loop problem.

EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters

(2bc)

•

Two-bit saturating counters (2bc)

[Smith 1981]

–

Replace each single-bit predictor with

•

(0,1,2,3) = (N,n,t,T)

–

Adds “hysteresis”

•

Force predictor to mis-predict twice before

“changing its mind”

–

One mispredict each loop execution

(rather than two)

Improves our inner-loop problem.

EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters

(2bc)

•

Two-bit saturating counters (2bc)

[Smith 1981]

–

Replace each single-bit predictor with

•

(0,1,2,3) = (N,n,t,T)

–

Adds “hysteresis”

•

Force predictor to mis-predict twice before

“changing its mind”

–

One mispredict each loop execution

(rather than two)

Improves our inner-loop problem.

EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters

(2bc)

•

Two-bit saturating counters (2bc)

[Smith 1981]

–

Replace each single-bit predictor with

•

(0,1,2,3) = (N,n,t,T)

–

Adds “hysteresis”

•

Force predictor to mis-predict twice before

“changing its mind”

–

One mispredict each loop execution

(rather than two)

Improves our inner-loop problem.

EE194/Comp140 Mark Hempstead

Intuition on how well this works

•

Simple predictors work well:

–

on branches that repeat the same behavior in streaks.

–

They work better when the streaks are longer

•

They don’t work well:

–

for random or data-dependent branches.

–

Data dependence can be common in some applications,

but it’s evil for branch prediction.

EE194/Comp140 Mark Hempstead

ECEC 621 Mark Hempstead

Frequency of

Mispredictions

0%

1%

5%

6%

6%

11%

4%

6%

5%

1%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

 4,096 entries:  2-bits per entry

 Unlimited entries:  2-bits/entry

 1,024 entries (2,2)

Accuracy of Different Schemes

4096 Entries 2-bit BHT

Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT

nasa7

matrix300

doducd

spice

fpppp

gcc

expresso

eqntott

li

tomcatv

Branch prediction is hard

•

Making predictions is hard, especially when

they’re about things that haven’t happened yet.

–

Yogi Berra

•

Corollary: we’re not going to find a perfect branch

predictor.

EE194/Comp140 Mark Hempstead

ECEC 621 Mark Hempstead

Hybrid Branch Predictors

•

Tournament predictors: Adaptively combine local

and global predictors

•

Different schemes work better for different branches

Local

Predictor

Global

Predictor

Chooser

Predictor

Taken or

Not-taken?

Could be

2-bit BHT

Branch prediction costs

•

The BTB and BHT are big arrays that run every

single cycle.

–

They are on critical paths, and thus must run fast.

–

They can be big power hogs.

–

They will never work well if there are data-dependent

branches.

–

For good single-stream performance, a good branch

prediction is very necessary.

EE194/Comp140 Mark Hempstead

More costs

•

“When you come to a fork in the road, take it”

–

Yogi Berra

•

The central idea of branch prediction is that doing nothing

is bad.

–

Don’t stall. Instead, s

peculatively execute

 a (hopefully pretty good)

prediction

–

But executing instructions takes energy, and if you have to kill

them then the energy was wasted.

•

Unless you’re predictors are

always

 right, then branch

prediction + speculative execution

always

 wastes energy in

flushed instructions.

•

Nonetheless, the tradeoff is

usually

 a good one.

EE194/Comp140 Mark Hempstead

How far should we go?

•

How many transistors should we spend on branch

prediction?

–

The BP itself costs power and area, but does no

computation.

–

A BP is necessary for good single-stream performance

–

We already saw diminishing returns on bigger BHTs

–

The difficult-to-predict branches are often data

dependent; a better/bigger algorithm won't help much

•

It would be really nice if we just didn't care about

single-stream performance

–

But we do – usually.

EE194/Comp140 Mark Hempstead

Takeaways

•

Speculation is necessary to get reasonable single-

stream performance

–

Waiting for all branches to be resolved is too slow

•

Branch prediction is needed for speculation to

work well

•

Good branch prediction is expensive and power

hungry

•

It's not nice to fool your branch predictor

–

And it makes your code run really slow

–

Irregular (e.g., data-dependent) branches

EE194/Comp140 Mark Hempstead

BACKUP

EE194/Comp140 Mark Hempstead

Interrupts

•

Interrupts are hard!

EE194/Comp140 Mark Hempstead

Types of interrupts

•

What kinds of things can go wrong in a program?

•

Floating-point errors:

•

Integer errors

•

Memory errors

•

Instruction errors

•

Others:

EE194/Comp140 Mark Hempstead

divide by 0, underflow, overflow

divide by 0

access unaligned data, illegal

address, virtual memory fault

illegal instruction, reserved instruction

user-requested breakpoint, I/O error, …

What do we do about it?

•

Terminate the program. Easy for the architect, not

so nice for the user

–

It means your program crashes, with no ability to

recover & keep going.

•

Resume after fixing the problem. Harder for the

architect, but it’s usually what the user wants.

–

Absolutely needed for, e.g., page faults & for user traps

for single stepping.

EE194/Comp140 Mark Hempstead

EE194/Comp140 Mark Hempstead

Precise interrupts

•

An interrupt is precise if, after an instruction takes

an interrupt, the machine state is such that

–

All earlier instructions are completed

–

The interrupting instruction seemingly never happened

–

All later instructions seeming never happened.

–

All interrupts are taken in program order

•

Seems easy, right? It’s not!

EE194/Comp140 Mark Hempstead

Interrupts can happen backwards

•

Cycle #3: the “Axx” takes an illegal-instruction

interrupt, since there is no “Axx” instruction

•

Cycle 4: the LDW takes an unaligned-memory

interrupt, since R2=0x1000 and you cannot read a

word from address 0x1003.

•

But the later instruction raised its interrupt first!

•

How do you solve this?

Exceptions are only handled in WB

EE194/Comp140 Mark Hempstead

Interrupts can still happen

backwards!

•

The ADD.D and ADD are in separate pipelines; the

floating-point pipe has more EX stages than does the

integer pipe

•

Cycle #6: the ADD finishes and writes the register

file.

•

Cycle #7: the ADD.F takes an overflow interrupt.

•

But the earlier instruction already changed the RF!

Branch-delay slots

•

Consider the code

•

The LD is a delay-slot instruction. MIPS always

executes it after the branch, even if the branch is

taken!

•

What if LD faults after BEQ is taken?

–

Must rerun it and then jump to “loop”

–

But you’re not re-executing the BEQ, so there’s no

reason to jump to loop.

•

Suddenly restarting after interrupt is harder.

EE194/Comp140 Mark Hempstead

loop: stuff…

…

BEQ R2, R4, #loop

LD R9, 30(R8)

EE194/Comp140 Mark Hempstead

Summary of Exceptions

•

Precise interrupts are a headache!

•

Delayed branches make them a bigger headache.

•

Preview: OOO makes them a still-bigger headache

–

More ways to have something write back earlier than the

exception

•

Some machines punt on the problem

–

Precise exceptions only for integer pipe

–

Special “precise mode” used for debugging (10x slower)

Parting quote

•

Democracy is the worst form of government there

is… except for all of the others.

–

Winston Churchill

•

Corollary: the same is true for imprecise

interrupts.

EE194/Comp140 Mark Hempstead

Backup

EE194/Comp140 Mark Hempstead

ECEC 621 Mark Hempstead

Branch prediction methods

•

When is information about branches

gathered/applied?

–

When the machine is designed

–

When the program is compiled (“compile-time”)

–

When a “training run” of the program is executed

(“profile-based”)

–

As the program is executing (“dynamic”)

•

Section 3.3 in textbook

EE194/Comp140 Mark Hempstead

Interrupt Taxonomy (C.4)

•

Synchronous vs. Asynchronous

(HW error, I/O)

•

User Request

(exception?)

vs. Coerced

•

User maskable vs. Nonmaskable (Ignorable)

•

Within vs. Between Instructions

•

Resume vs. Terminate

The difficult exceptions are

resumable

 interrupts

within

 instructions

•

Save the state, correct the cause, restore the state,

continue execution

EE194/Comp140 Mark Hempstead

Interrupt Taxonomy

EE194/Comp140 Mark Hempstead

Interrupts on Instruction Phases

•

Exceptions can occur on many different phases

•

However, exceptions are only handled in WB

•

Why?

load

IF

ID

EX

MEM

WB

add

IF

ID

EX

MEM

WB

ECEC 621 Mark Hempstead

Tournament Predictors

Tournament predictor using, say, 4K 2-bit counters indexed by

local branch address.  Chooses between:

•

Global predictor

–

4K entries index by history of last 12 branches (2

12 =

4K)

–

Each entry is a standard 2-bit predictor

•

Local predictor

–

Local history table: 1024 10-bit entries recording last 10 branches,

index by branch address

–

The pattern of the last 10 occurrences of that particular branch used

to index table of 1K entries with 3-bit saturating counters

ECEC 621 Mark Hempstead

Comparing Predictors (Fig. 2.8)

•

Advantage of tournament predictor is ability to select the

right predictor for a particular branch

–

Particularly crucial for integer benchmarks.

–

A typical tournament predictor will select the global predictor almost

40% of the time for the SPEC integer benchmarks and less than 15%

of the time for the SPEC FP benchmarks

Core i7 Branch Predictor

•

2-levels of predictors

–

Simple 1

st

 level predictor

–

Slower more accurate 2

nd

level predictor

•

Each predictor has three

components

–

Simple 2-bit predictor

(Appendix C)

–

Global history predictor

–

Loop exit predictor

ECEC 621

Mark Hempstead

ECEC 621 Mark Hempstead

Why predict? Speculative Execution

•

Execute

beyond

 branch boundaries before the

branch is resolved

•

Correct Speculation

–

Avoid stall, result is computed early, performance++

•

Incorrect Speculation

–

Abort/squash incorrect instructions, complexity+

–

Undo any incorrect state changes, complexity++

•

Performance gain is weighed vs. penalty

•

Speculation accuracy = branch prediction accuracy

ECEC 621 Mark Hempstead

Branch Direction Prediction

•

Basic idea: Hope that future behavior of the

branch is correlated to past behavior

–

Loops

–

Error-checking conditionals

•

For a single branch PC

–

Simplest possible idea: Keep 1 bit around to indicate

taken or not-taken

–

nd

 simplest idea: Keep 2 bits around, saturating counter

•

Store a “cache” of counters for recent PC

addresses. Use LSB of PC as index into cache

ECEC 621 Mark Hempstead

Dynamic Branch Prediction

•

 Why does prediction work?

–

Underlying algorithm has regularities

–

Data that is being operated on has regularities

–

Instruction sequence has redundancies that are artifacts of

way that humans/compilers think about problems

•

Is dynamic branch prediction better than static branch

prediction?

–

Seems to be

–

There are a small number of important branches in programs

which have dynamic behavior

ECEC 621 Mark Hempstead

Branch Prediction Buffer

(branch history table, BHT)

•

Small memory indexed with low bits of the

branch instruction’s address

–

Why the low bits?

•

Implementation

–

Separate memory accessed during IF phase

–

2-bits attached to each block in the Instruction

Cache

•

Caveats: Cannot separately size I-Cache and BHT

•

What about multiple branches in a cache line?

–

Does this help our simple 5-stage pipeline?

PC

12-bits

 = 4K Entries

Taken or

Not-taken?

ECEC 621 Mark Hempstead

2-level Predictor with Global History

•

Global history --

1 Branch History Register (BHR)

•

Per-address/set history

–

Per-Address/set Branch History Table holds many BHRs

•

Concatenate global k-bit (shift register) with PC bits to

index PHT

PC

Taken or

Not-taken?

K-bits

EE194/Comp140 Mark Hempstead

How to take an exception

1.

Force a trap instruction on the next IF

2.

Squash younger instructions (Turn off all writes

(register/memory) for faulting instruction and all

instructions that follow it)

3.

Save all processor state after trap begins

•

PC-chain, PSW, Condition Codes, trap condition

•

PC-chain is length of the branch delay plus 1

4.

Perform the trap/exception code then restart

where we left off

Extra-credit problem

•

Pronounce your own name.

•

Pronounce “Donald Knuth.”

•

Which of the following would make the best

president?

–

Donald Trump

–

Donald Knuth

–

Donald Duck

•

Worth 1 point on the midterm.

EE194/Comp140 Mark Hempstead

Branch Prediction Performance

•

Parameters

–

Branch: 20%

, load: 20%, store: 10%, other: 50%

–

75% of branches are taken

•

Dynamic branch prediction

–

Branches predicted with 95% accuracy

–

CPI = 1 + 20% * 5% * 2 =

1.02

EE194/Comp140 Mark Hempstead

Dynamic Branch Prediction Components

•

Step #1: is it a branch?

–

Easy after decode...

•

Step #2: is the branch taken or not taken?

–

Direction predictor

(applies to conditional branches only)

–

Predicts taken/not-taken

•

Step #3: if the branch is taken, where does it go?

–

Branch

target predictor

–

Easy after decode…

regfile

D$

I$

EE194/Comp140 Mark Hempstead

ECEC 621 Mark Hempstead

Two-bit Saturating Counters,

take 2

•

Slightly different version works better for some benchmarks.

Taken

Taken

Taken

Taken

Not Taken

Not Taken

Not Taken

Not Taken

strongly

taken

strongly

not taken

weakly

not taken

weakly

taken

Branch Target Buffer (BTB)

•

It’s little use predicting taken/not early in ID if you don’t know the

target address until late in EX.

–

So we record the past branch targets in a hardware structure

•

Branch target buffer (BTB):

–

“Last time the branch X was taken, it went to address Y”

•

“So, in the future, if address X is fetched, fetch address Y next”

•

Works well because most control insns use

direct targets

–

Target encoded in insn itself



same “taken” target every time

•

What about

indirect targets

–

Target held in a register



can be different each time

–

Two indirect call idioms

•

Dynamically linked functions (DLLs): target always the same

•

Dynamically dispatched (virtual) functions: hard but uncommon

–

Also two indirect unconditional jump idioms

•

Switches: hard but uncommon

•

Function returns: hard and common but…

EE194/Comp140 Mark Hempstead

Implementation

•

A small RAM: address = PC, data = target-PC

•

Access at Fetch

in parallel

 with instruction memory

–

predicted-target = BTB[hash(PC)]

•

Updated whenever target != predicted-target

–

BTB[hash(PC)] = correct-target

•

Very similar to BHT, but it’s a full address width and not

just one taken/not bit.

•

Aliasing?

EE194/Comp140 Mark Hempstead

Target address

[9:2]

1:0

[31:10]

PC

BTB

BTB idx

No problem, this is only a prediction

Target address

Return Address Stack (RAS)

•

Return address stack (RAS)

–

Call instruction? RAS[TopOfStack++] = PC+4

–

Return instruction? Predicted-target = RAS[--TopOfStack]

–

Q: how can you tell if an insn is a call/return before decoding it?

•

Accessing RAS on every instruction would waste power

–

Answer:

•

another predictor (or put them in BTB marked as “return”)

•

Or,

pre-decode bits

 in insn mem, written when first executed

PC

BTB

tag

==

target

predicted target

RAS

EE194/Comp140 Mark Hempstead

Putting It All Together

•

BTB & branch direction predictor during fetch

•

If

 branch prediction correct, then no penalty for

branches!

PC

BTB

tag

==

target

predicted target

RAS

BHT

taken/not-taken

EE194/Comp140 Mark Hempstead

Branch Prediction Performance

•

Dynamic branch prediction

•

20% of instruction branches

–

Simple predictor: branches predicted with 75% accuracy

•

CPI = 1 + (20% *

25%

* 2) =

1.1

–

More advanced predictor: 95% accuracy

•

CPI = 1 + (20% *

5%

* 2) =

1.02

•

Branch mis-predictions still a big problem though

–

Pipelines are long: typical mis-prediction penalty is 10+

cycles

–

For cores that do more per cycle, predictions most costly

(later)

•

Many other methods for building a better branch

predictor  including correlating predictors, global

history tables(Chapter 3)

EE194/Comp140 Mark Hempstead

Integer

ECEC 621 Mark Hempstead

BHT Accuracy

•

Mispredict because either:

–

Wrong guess for that branch

–

Got branch history of wrong branch due to aliasing (4K-entry

table shown here).

Floating Point

Another way:

Prediction

accuracy between

99% and 82%

ADVANCED BRANCH

PREDICTION (3.3)

ECEC 621

Mark Hempstead

ECEC 621 Mark Hempstead

Correlating Predictors

•

2-bit scheme only looks at branch’s

own

 history to

predict its behavior

•

What if we use other branches to predict it as well?

if (aa==2)aa=0;

if (bb==2)bb=0;

if (aa!=bb){..}

•

Does branch #3 depends on outcome of #1 and #2?

–

Yes. If #1 & #2 are both taken, then #3 will not be taken.

–

Sometimes the past can predict the future with certainty!

–

This is called “global history.”

// Branch #1

// Branch #2

// Branch #3

Another example of global

history

•

Consider our inner loop again

–

The branch pattern is (T,T,T,N) repeating

•

Consider the following global rule:

–

If the last branches were “T,T,T” then predict not taken

–

Else predict taken.

–

Works perfectly – the trick is having the predictor

figure out that rule.

•

Let’s make a predictor that looks at both the

history at this branch, as well as recent outcomes

of all branches (including this one).

EE194/Comp140 Mark Hempstead

ECEC 621 Mark Hempstead

Branch History Register

•

Simple shift register

–

Shift in branch outcomes as they occur

–

1 => branch was taken

–

0 => branch was not-taken

–

k-bit BHR => 2

 patterns

–

Now we know the outcome of the last

 branches.

–

Use these patterns to address into the Pattern History Table.

Effectively the PHT is a separate BHT for each global

pattern.

ECEC 621 Mark Hempstead

Generic Adaptive Branch

Prediction (Correlating Predictor)

•

Two-level BP requires two main

components

–

Branch history register (BHR):

recent outcomes of branches (last

branches encountered)

–

Pattern History Table (PHT):

branch behavior for recent

occurrences of the specific pattern

of these

 branches

–

In effect, we concatenate BHR with

Branch PC bits

•

We’ve drawn a (2,2) predictor

PC

12-bits

 = 4K Entries each (PHTs)

Taken or

Not-taken?

2-bit BHR

ECEC 621 Mark Hempstead

Pattern History Table

•

Has 2

 entries

•

Usually uses a 2-bit counter for the prediction

•

BHR is used to address the PHT. I.e., a (k,2) PHT

is an array of 2

 BHTs, each of which has a 2-bit

counter.

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PBHR, PC) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

Correlated (two-level)

predictor

[Patt 1991]

–

Exploits observation that

branch outcomes are correlated

–

Maintains separate prediction

per (PC, BHR) pairs

•

Branch history register (BHR)

recent branch outcomes

–

Simple working example:

assume program has one branch

•

BHT: one 1-bit DIRP entry

•

BHT+

2BHR

: 2

 1-bit DIRP

entries

(2,1) correlated Predictor

•

What went wrong?

–

Look at the grey squares; they

log our mistakes.

–

For NN, NT, TN we only made

one mistake. But TT kept

making mistakes. Why?

•

We have both T,T,T and also

T,T,N.

•

Every time we went from one to

the other we got the wrong answer

and had to re-train our predictor.

–

How do we fix this simply?

•

Go to a (3,1) predictor.

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

3-bit BHR

•

 DIRP

entries per

pattern

•

One-bit BHT

(not a 2-bit

saturating

counter)

(3,1) correlated predictor

•

With 3 bits of

history, the

history is now

enough to fully

predict the

branch result.

•

Any column

has only one

grey square;

once we train

the predictor

for a given

history it never

makes another

mistake.

ECEC 621 Mark Hempstead

Hardware Costs of 2-level

predictions

•

(m,n) predictor



 m-bits of global history, n-bit

predictor

•

*n*Number of prediction entries

•

Say you have m-bits of history (m=2)

•

n-bits of predictor per entries (n=2)

(2,2) predictor with 1K prediction entries

*2*1024 = 8K-bits

EE194/Comp140 Mark Hempstead

Outline

•

Control Hazards and Branch Prediction

•

Interrupts

•

Memory Hierarchy

•

Cache review

Slide Note

Embed Share

Download

This content delves into the concepts of branch prediction, memory hierarchy, and caches in high-performance computer architecture. Topics covered include pipelined instructions, handling stalls, and addressing hazards with branches. The discussion explores pipelining challenges and the complexities of predicting outcomes in various instruction cycles.

cder Follow

Uploaded on Feb 26, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ECEC 194: High Performance Computer Architecture Spring 2016 Tufts University Instructor: Prof. Mark Hempstead mark@ece.tufts.edu Lecture 3: Review of Branch Prediction, Memory Hierarchy and Caches EE194/Comp140 Mark Hempstead 1

More pipelining and stalls Consider a random assembly program load r2=mem[r1] add r5=r3+r2 add r8=r6+r7 The architecture says that instructions are executed in order. The 2nd instruction uses the r2 written by the first instruction EE194/Comp140 Mark Hempstead 2

Pipelined instructions Instruction load r2=mem[r1] fetch Cycle 1 Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8 read R1 execute nothing cache fetch read r3,r2 r3,r2 fetch read r6,r7 fetch access write r2 add r5=r3+r2 add load/st nothing add r6,r7 read r9,r10 write r5 add r8=r6+r7 load/st nothing execute nothing write r8 store mem[r9]=r10 store write nothing Pipelining cheats! It launches instructions before the previous ones finish. Hazards occur when one computation uses the results of another it exposes our sleight of hand EE 193 Joel Grodstein 3

Same problem with branches Instruction Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2=0 fetch read r2 sub r8=r6-r7 access cache read r2 write r2 read r2 fetch read r6,r7 add r6,r7 fetch read load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 Same problem as before we don't know whether to fetch the sub" or the "add until you know r2. The simplest solution is to stall, just like before EE 193 Joel Grodstein 4

What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 sub r8=r6-r7 fetch Instruction access cache read r2 read r6,r7 add r6,r7 write r2 read r2 access nothing execute write r8 fetch read access Let's predict that r2 0, so we don't take the branch, and we will execute r8=r6-r7 We start executing it right away, before we know if we are really supposed to. What could go wrong with this? load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 5

What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 sub r8=r6-r7 fetch Instruction access cache read r2 read r6,r7 add r6,r7 write r2 read r2 access nothing execute write r8 fetch read access What could go wrong with this? What if we hit cycle 5 and find that r2=0? We must undo the sub, and do the add instead load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 6

What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 add r8=r8+1 Instruction access cache read r2 write r2 read r2 fetch read r8 Easy, right? What if the load took longer, so we already wrote r8? load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 7

What prediction looks like Cycle 1 2 load r2=mem[r1]fetch read R1 nothing branch if r2==0 fetch read r2 read r2 read r2 read r2 sub r8=r6-r7 fetch Instruction Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 execute access cache access cache access cache access cache read r2 write r8 write r2 read r2 read r6,r7 fetch add r6,r7 read access nothing execute access load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 Now, we have a cache miss and it takes longer for the load to finish. Problems with this scenario? We hit cycle 5 and find that r2=0 We want to cancel the "sub" but it already wrote r8! We'll talk about how to fix this later EE 193 Joel Grodstein 8

Branch prediction Guessing if a branch will be taken is called branch prediction Bottom line: it's hard to fix a wrong prediction So you better try to get it right most of the time Now we ll look at branch prediction. Hardware guesses branch outcome Start fetching from guessed address Somehow fix things up on a mis-predict. EE194/Comp140 Mark Hempstead 9

People branch-predict too Buy an engagement ring, predicting a yes answer If you guessed wrong, then throw away the ring? Pick which way to drive to work based on what the traffic was like yesterday If today s backups are different, then make excuses when you get to work EE194/Comp140 Mark Hempstead 10

Branch prediction is hard Making predictions is hard, especially when they re about things that haven t happened yet. Yogi Berra Corollary: branch prediction is hard EE194/Comp140 Mark Hempstead 11

Branch Characteristics Integer Benchmarks: 14-16% instructions are conditional branches FP: 3-12% On Average: 67% of conditional branches are taken 60% of forward branches are taken 85% of backward branches are taken. Why? They tend to be the termination condition of a loop EE194/Comp140 Mark Hempstead 12

Simplest of predictors Really simple predictor: 67% of conditional branches are taken So assume branches are always taken Sounds nice, but being right 67% of the time isn't nearly good enough The penalty for being wrong is usually far too high Ditto for "assume not taken." EE194/Comp140 Mark Hempstead 13

Branch Prediction Learn from past, predict the future Record the past in a hardware structure Direction predictor (DIRP) Map conditional-branch PC to taken/not-taken (T/N) decision Individual conditional branches often biased or weakly biased 90%+ one way or the other considered biased Why? Loop back edges, checking for uncommon conditions Branch history table (BHT): simplest predictor PC indexes table of bits (0 = Not taken, 1 = Taken), no tags Essentially: branch will go same way it went last time BHT Not shown: some way to update the BHT after a mis-predict PC [31:10] [9:2] 1:0 T or NT BHT idx What about aliasing? Two PC with the same lower bits? No problem, just a prediction! T or NT Prediction (taken or not taken) EE194/Comp140 Mark Hempstead 14

Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 2 3 4 5 6 7 8 9 10 11 12 State Result? T T T N T T T N T T T N EE194/Comp140 Mark Hempstead 15

Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 4 5 6 7 8 9 10 11 12 State Result? N T Wrong EE194/Comp140 Mark Hempstead 16

Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 5 6 7 8 9 10 11 12 State Result? N T T T Wrong Correct EE194/Comp140 Mark Hempstead 17

Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 6 7 8 9 10 11 12 State Result? N T T T T T Wrong Correct Correct EE194/Comp140 Mark Hempstead 18

Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 7 8 9 10 11 12 State Result? N T T T T T T N Wrong Correct Correct Wrong EE194/Comp140 Mark Hempstead 19

Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 T 7 8 9 10 11 12 State Result? N T T T N T T T N T Wrong Correct Correct Wrong Wrong EE194/Comp140 Mark Hempstead 20

Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<4;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 T 7 T 8 T 9 N 10 T 11 T 12 T State Result? N T T T N T T T N T T T T T T N T T T N T T T N Wrong Correct Correct Wrong Wrong Correct Correct Wrong Wrong Correct Correct Wrong EE194/Comp140 Mark Hempstead 21

Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N T Wrong n EE194/Comp140 Mark Hempstead 22

Two-bit Saturating Counters Taken Not Taken strongly taken weakly taken Predict Taken 11 Predict Taken 10 Taken Taken Not Taken strongly not taken weakly not taken Taken Predict Not Taken 00 Not Taken Predict Not Taken 01 Not Taken 2-bit FSMs mean prediction must miss twice before change N-bit predictors are possible, but after 2-bits not much benefit ECEC 621 Mark Hempstead 23

Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T Wrong Wrong n t EE194/Comp140 Mark Hempstead 24

Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T Wrong Wrong Correct n t EE194/Comp140 Mark Hempstead 25

Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T N Wrong Wrong Correct Wrong n t t EE194/Comp140 Mark Hempstead 26

Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 T 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T T N T Wrong Wrong Correct Wrong Correct n t t EE194/Comp140 Mark Hempstead 27

Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 T 7 T 8 T 9 10 T 11 T 12 T Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T T T T T T T T T N T T T N T T T N Wrong Wrong Correct Wrong Correct Correct Correct Wrong Correct Correct Correct Wrong n t t t EE194/Comp140 Mark Hempstead 28

Intuition on how well this works Simple predictors work well: on branches that repeat the same behavior in streaks. They work better when the streaks are longer They don t work well: for random or data-dependent branches. Data dependence can be common in some applications, but it s evil for branch prediction. EE194/Comp140 Mark Hempstead 29

Accuracy of Different Schemes 20% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% Frequency of Mispredictions 16% 14% 12% 11% 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0 % spice gcc fpppp nasa7 doducd matrix300 li tomcatv expresso eqntott 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) ECEC 621 Mark Hempstead 30

Branch prediction is hard Making predictions is hard, especially when they re about things that haven t happened yet. Yogi Berra Corollary: we re not going to find a perfect branch predictor. EE194/Comp140 Mark Hempstead 31

Hybrid Branch Predictors Tournament predictors: Adaptively combine local and global predictors Different schemes work better for different branches Local Predictor Global Predictor Could be 2-bit BHT Chooser Predictor Taken or Not-taken? ECEC 621 Mark Hempstead 32

Branch prediction costs The BTB and BHT are big arrays that run every single cycle. They are on critical paths, and thus must run fast. They can be big power hogs. They will never work well if there are data-dependent branches. For good single-stream performance, a good branch prediction is very necessary. EE194/Comp140 Mark Hempstead 33

More costs When you come to a fork in the road, take it Yogi Berra The central idea of branch prediction is that doing nothing is bad. Don t stall. Instead, speculatively execute a (hopefully pretty good) prediction But executing instructions takes energy, and if you have to kill them then the energy was wasted. Unless you re predictors are always right, then branch prediction + speculative execution always wastes energy in flushed instructions. Nonetheless, the tradeoff is usually a good one. EE194/Comp140 Mark Hempstead 34

How far should we go? How many transistors should we spend on branch prediction? The BP itself costs power and area, but does no computation. A BP is necessary for good single-stream performance We already saw diminishing returns on bigger BHTs The difficult-to-predict branches are often data dependent; a better/bigger algorithm won't help much It would be really nice if we just didn't care about single-stream performance But we do usually. EE194/Comp140 Mark Hempstead 35

Takeaways Speculation is necessary to get reasonable single- stream performance Waiting for all branches to be resolved is too slow Branch prediction is needed for speculation to work well Good branch prediction is expensive and power hungry It's not nice to fool your branch predictor And it makes your code run really slow Irregular (e.g., data-dependent) branches EE194/Comp140 Mark Hempstead 36

BACKUP EE194/Comp140 Mark Hempstead 37

Interrupts Interrupts are hard! EE194/Comp140 Mark Hempstead 38

Types of interrupts What kinds of things can go wrong in a program? Floating-point errors: Integer errors Memory errors access unaligned data, illegal address, virtual memory fault divide by 0, underflow, overflow divide by 0 Instruction errors Others: illegal instruction, reserved instruction user-requested breakpoint, I/O error, EE194/Comp140 Mark Hempstead 39

What do we do about it? Terminate the program. Easy for the architect, not so nice for the user It means your program crashes, with no ability to recover & keep going. Resume after fixing the problem. Harder for the architect, but it s usually what the user wants. Absolutely needed for, e.g., page faults & for user traps for single stepping. EE194/Comp140 Mark Hempstead 40

Precise interrupts An interrupt is precise if, after an instruction takes an interrupt, the machine state is such that All earlier instructions are completed The interrupting instruction seemingly never happened All later instructions seeming never happened. All interrupts are taken in program order Seems easy, right? It s not! EE194/Comp140 Mark Hempstead 41

Interrupts can happen backwards Cycle LDW R3, R2, #3 Axx R4, R3, R5 1 IF 2 ID IF 3 EX ID 4 MEM EX 5 WB MEM 6 WB Cycle #3: the Axx takes an illegal-instruction interrupt, since there is no Axx instruction Cycle 4: the LDW takes an unaligned-memory interrupt, since R2=0x1000 and you cannot read a word from address 0x1003. But the later instruction raised its interrupt first! How do you solve this? Exceptions are only handled in WB EE194/Comp140 Mark Hempstead 42

Interrupts can still happen backwards! Cycle ADD.D F3,F2,F1 ADD R4, R3, R5 1 IF 2 ID IF 3 EX1 ID 4 EX2 EX 5 EX3 MEM WB 6 EX4 EX5 7 8 WB The ADD.D and ADD are in separate pipelines; the floating-point pipe has more EX stages than does the integer pipe Cycle #6: the ADD finishes and writes the register file. Cycle #7: the ADD.F takes an overflow interrupt. But the earlier instruction already changed the RF! EE194/Comp140 Mark Hempstead 43

Branch-delay slots loop: stuff BEQ R2, R4, #loop LD R9, 30(R8) Consider the code The LD is a delay-slot instruction. MIPS always executes it after the branch, even if the branch is taken! What if LD faults after BEQ is taken? Must rerun it and then jump to loop But you re not re-executing the BEQ, so there s no reason to jump to loop. Suddenly restarting after interrupt is harder. EE194/Comp140 Mark Hempstead 44

Summary of Exceptions Precise interrupts are a headache! Delayed branches make them a bigger headache. Preview: OOO makes them a still-bigger headache More ways to have something write back earlier than the exception Some machines punt on the problem Precise exceptions only for integer pipe Special precise mode used for debugging (10x slower) EE194/Comp140 Mark Hempstead 45

Parting quote Democracy is the worst form of government there is except for all of the others. Winston Churchill Corollary: the same is true for imprecise interrupts. EE194/Comp140 Mark Hempstead 46

Backup EE194/Comp140 Mark Hempstead 47

Branch prediction methods When is information about branches gathered/applied? When the machine is designed When the program is compiled ( compile-time ) When a training run of the program is executed ( profile-based ) As the program is executing ( dynamic ) Section 3.3 in textbook ECEC 621 Mark Hempstead 48

Interrupt Taxonomy (C.4) Synchronous vs. Asynchronous (HW error, I/O) User Request (exception?) vs. Coerced User maskable vs. Nonmaskable (Ignorable) Within vs. Between Instructions Resume vs. Terminate The difficult exceptions are resumable interrupts within instructions Save the state, correct the cause, restore the state, continue execution EE194/Comp140 Mark Hempstead 49

Interrupt Taxonomy Exception Type Sync vs. Async User Request Vs. Coerced User mask vs. nommask Within vs. Between Insn Resume vs. terminate I/O Device Req. Async Coerced Nonmask Between Resume Invoke O/S Sync User Nonmask Between Resume Tracing Instructions Sync User Maskable Between Resume Breakpoint Sync User Maskable Between Resume Arithmetic Overflow Sync Coerced Maskable Within Resume Page Fault (not in main m) Sync Coerced Nonmask Within Resume Misaligned Memory Sync Coerced Maskable Within Resume Mem. Protection Violation Sync Coerced Nonmask Within Resume Using Undefined Insns Sync Coerced Nonmask Within Terminate Hardware/Power Failure Async Coerced Nonmask Within Terminate EE194/Comp140 Mark Hempstead 50

ECEC 194: High Performance Computer Architecture

Download Presentation

Presentation Transcript

Related

More Related Content