ECEC 194: High Performance Computer Architecture
This content delves into the concepts of branch prediction, memory hierarchy, and caches in high-performance computer architecture. Topics covered include pipelined instructions, handling stalls, and addressing hazards with branches. The discussion explores pipelining challenges and the complexities of predicting outcomes in various instruction cycles.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
ECEC 194: High Performance Computer Architecture Spring 2016 Tufts University Instructor: Prof. Mark Hempstead mark@ece.tufts.edu Lecture 3: Review of Branch Prediction, Memory Hierarchy and Caches EE194/Comp140 Mark Hempstead 1
More pipelining and stalls Consider a random assembly program load r2=mem[r1] add r5=r3+r2 add r8=r6+r7 The architecture says that instructions are executed in order. The 2nd instruction uses the r2 written by the first instruction EE194/Comp140 Mark Hempstead 2
Pipelined instructions Instruction load r2=mem[r1] fetch Cycle 1 Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8 read R1 execute nothing cache fetch read r3,r2 r3,r2 fetch read r6,r7 fetch access write r2 add r5=r3+r2 add load/st nothing add r6,r7 read r9,r10 write r5 add r8=r6+r7 load/st nothing execute nothing write r8 store mem[r9]=r10 store write nothing Pipelining cheats! It launches instructions before the previous ones finish. Hazards occur when one computation uses the results of another it exposes our sleight of hand EE 193 Joel Grodstein 3
Same problem with branches Instruction Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2=0 fetch read r2 sub r8=r6-r7 access cache read r2 write r2 read r2 fetch read r6,r7 add r6,r7 fetch read load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 Same problem as before we don't know whether to fetch the sub" or the "add until you know r2. The simplest solution is to stall, just like before EE 193 Joel Grodstein 4
What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 sub r8=r6-r7 fetch Instruction access cache read r2 read r6,r7 add r6,r7 write r2 read r2 access nothing execute write r8 fetch read access Let's predict that r2 0, so we don't take the branch, and we will execute r8=r6-r7 We start executing it right away, before we know if we are really supposed to. What could go wrong with this? load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 5
What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 sub r8=r6-r7 fetch Instruction access cache read r2 read r6,r7 add r6,r7 write r2 read r2 access nothing execute write r8 fetch read access What could go wrong with this? What if we hit cycle 5 and find that r2=0? We must undo the sub, and do the add instead load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 6
What prediction looks like Cycle 1 Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1 execute nothing branch if r2==0 fetch read r2 add r8=r8+1 Instruction access cache read r2 write r2 read r2 fetch read r8 Easy, right? What if the load took longer, so we already wrote r8? load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 EE 193 Joel Grodstein 7
What prediction looks like Cycle 1 2 load r2=mem[r1]fetch read R1 nothing branch if r2==0 fetch read r2 read r2 read r2 read r2 sub r8=r6-r7 fetch Instruction Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 execute access cache access cache access cache access cache read r2 write r8 write r2 read r2 read r6,r7 fetch add r6,r7 read access nothing execute access load r2=mem[r1] if (r2==0) goto L1 sub r8=r6-r7 L1: add r8=r8+r1 Now, we have a cache miss and it takes longer for the load to finish. Problems with this scenario? We hit cycle 5 and find that r2=0 We want to cancel the "sub" but it already wrote r8! We'll talk about how to fix this later EE 193 Joel Grodstein 8
Branch prediction Guessing if a branch will be taken is called branch prediction Bottom line: it's hard to fix a wrong prediction So you better try to get it right most of the time Now we ll look at branch prediction. Hardware guesses branch outcome Start fetching from guessed address Somehow fix things up on a mis-predict. EE194/Comp140 Mark Hempstead 9
People branch-predict too Buy an engagement ring, predicting a yes answer If you guessed wrong, then throw away the ring? Pick which way to drive to work based on what the traffic was like yesterday If today s backups are different, then make excuses when you get to work EE194/Comp140 Mark Hempstead 10
Branch prediction is hard Making predictions is hard, especially when they re about things that haven t happened yet. Yogi Berra Corollary: branch prediction is hard EE194/Comp140 Mark Hempstead 11
Branch Characteristics Integer Benchmarks: 14-16% instructions are conditional branches FP: 3-12% On Average: 67% of conditional branches are taken 60% of forward branches are taken 85% of backward branches are taken. Why? They tend to be the termination condition of a loop EE194/Comp140 Mark Hempstead 12
Simplest of predictors Really simple predictor: 67% of conditional branches are taken So assume branches are always taken Sounds nice, but being right 67% of the time isn't nearly good enough The penalty for being wrong is usually far too high Ditto for "assume not taken." EE194/Comp140 Mark Hempstead 13
Branch Prediction Learn from past, predict the future Record the past in a hardware structure Direction predictor (DIRP) Map conditional-branch PC to taken/not-taken (T/N) decision Individual conditional branches often biased or weakly biased 90%+ one way or the other considered biased Why? Loop back edges, checking for uncommon conditions Branch history table (BHT): simplest predictor PC indexes table of bits (0 = Not taken, 1 = Taken), no tags Essentially: branch will go same way it went last time BHT Not shown: some way to update the BHT after a mis-predict PC [31:10] [9:2] 1:0 T or NT BHT idx What about aliasing? Two PC with the same lower bits? No problem, just a prediction! T or NT Prediction (taken or not taken) EE194/Comp140 Mark Hempstead 14
Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 2 3 4 5 6 7 8 9 10 11 12 State Result? T T T N T T T N T T T N EE194/Comp140 Mark Hempstead 15
Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 4 5 6 7 8 9 10 11 12 State Result? N T Wrong EE194/Comp140 Mark Hempstead 16
Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 5 6 7 8 9 10 11 12 State Result? N T T T Wrong Correct EE194/Comp140 Mark Hempstead 17
Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 6 7 8 9 10 11 12 State Result? N T T T T T Wrong Correct Correct EE194/Comp140 Mark Hempstead 18
Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 7 8 9 10 11 12 State Result? N T T T T T T N Wrong Correct Correct Wrong EE194/Comp140 Mark Hempstead 19
Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 T 7 8 9 10 11 12 State Result? N T T T N T T T N T Wrong Correct Correct Wrong Wrong EE194/Comp140 Mark Hempstead 20
Branch History Table (BHT) Prediction Outcome Branch history table (BHT): simplest direction predictor PC indexes table of bits (0 = N, 1 = T), no tags Predicts the branch to go same way it went last time Problem: inner loop branch below for (i=0;i<100;i++) for (j=0;j<4;j++) // whatever It will be wrong twice per inner loop; once each at the beginning and the end. Branch predictor changes its mind too quickly Time 1 N 2 T 3 T 4 T 5 N 6 T 7 T 8 T 9 N 10 T 11 T 12 T State Result? N T T T N T T T N T T T T T T N T T T N T T T N Wrong Correct Correct Wrong Wrong Correct Correct Wrong Wrong Correct Correct Wrong EE194/Comp140 Mark Hempstead 21
Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N T Wrong n EE194/Comp140 Mark Hempstead 22
Two-bit Saturating Counters Taken Not Taken strongly taken weakly taken Predict Taken 11 Predict Taken 10 Taken Taken Not Taken strongly not taken weakly not taken Taken Predict Not Taken 00 Not Taken Predict Not Taken 01 Not Taken 2-bit FSMs mean prediction must miss twice before change N-bit predictors are possible, but after 2-bits not much benefit ECEC 621 Mark Hempstead 23
Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T Wrong Wrong n t EE194/Comp140 Mark Hempstead 24
Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T Wrong Wrong Correct n t EE194/Comp140 Mark Hempstead 25
Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T N Wrong Wrong Correct Wrong n t t EE194/Comp140 Mark Hempstead 26
Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 T 7 8 9 10 11 12 Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T T N T Wrong Wrong Correct Wrong Correct n t t EE194/Comp140 Mark Hempstead 27
Two-Bit Saturating Counters (2bc) Prediction Outcome Time 1 N 2 3 4 T 5 6 T 7 T 8 T 9 10 T 11 T 12 T Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit predictor with (0,1,2,3) = (N,n,t,T) Adds hysteresis Force predictor to mis-predict twice before changing its mind One mispredict each loop execution (rather than two) + Improves our inner-loop problem. State Result? N N T T T T T T T T T T T T T N T T T N T T T N Wrong Wrong Correct Wrong Correct Correct Correct Wrong Correct Correct Correct Wrong n t t t EE194/Comp140 Mark Hempstead 28
Intuition on how well this works Simple predictors work well: on branches that repeat the same behavior in streaks. They work better when the streaks are longer They don t work well: for random or data-dependent branches. Data dependence can be common in some applications, but it s evil for branch prediction. EE194/Comp140 Mark Hempstead 29
Accuracy of Different Schemes 20% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% Frequency of Mispredictions 16% 14% 12% 11% 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0 % spice gcc fpppp nasa7 doducd matrix300 li tomcatv expresso eqntott 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) ECEC 621 Mark Hempstead 30
Branch prediction is hard Making predictions is hard, especially when they re about things that haven t happened yet. Yogi Berra Corollary: we re not going to find a perfect branch predictor. EE194/Comp140 Mark Hempstead 31
Hybrid Branch Predictors Tournament predictors: Adaptively combine local and global predictors Different schemes work better for different branches Local Predictor Global Predictor Could be 2-bit BHT Chooser Predictor Taken or Not-taken? ECEC 621 Mark Hempstead 32
Branch prediction costs The BTB and BHT are big arrays that run every single cycle. They are on critical paths, and thus must run fast. They can be big power hogs. They will never work well if there are data-dependent branches. For good single-stream performance, a good branch prediction is very necessary. EE194/Comp140 Mark Hempstead 33
More costs When you come to a fork in the road, take it Yogi Berra The central idea of branch prediction is that doing nothing is bad. Don t stall. Instead, speculatively execute a (hopefully pretty good) prediction But executing instructions takes energy, and if you have to kill them then the energy was wasted. Unless you re predictors are always right, then branch prediction + speculative execution always wastes energy in flushed instructions. Nonetheless, the tradeoff is usually a good one. EE194/Comp140 Mark Hempstead 34
How far should we go? How many transistors should we spend on branch prediction? The BP itself costs power and area, but does no computation. A BP is necessary for good single-stream performance We already saw diminishing returns on bigger BHTs The difficult-to-predict branches are often data dependent; a better/bigger algorithm won't help much It would be really nice if we just didn't care about single-stream performance But we do usually. EE194/Comp140 Mark Hempstead 35
Takeaways Speculation is necessary to get reasonable single- stream performance Waiting for all branches to be resolved is too slow Branch prediction is needed for speculation to work well Good branch prediction is expensive and power hungry It's not nice to fool your branch predictor And it makes your code run really slow Irregular (e.g., data-dependent) branches EE194/Comp140 Mark Hempstead 36
BACKUP EE194/Comp140 Mark Hempstead 37
Interrupts Interrupts are hard! EE194/Comp140 Mark Hempstead 38
Types of interrupts What kinds of things can go wrong in a program? Floating-point errors: Integer errors Memory errors access unaligned data, illegal address, virtual memory fault divide by 0, underflow, overflow divide by 0 Instruction errors Others: illegal instruction, reserved instruction user-requested breakpoint, I/O error, EE194/Comp140 Mark Hempstead 39
What do we do about it? Terminate the program. Easy for the architect, not so nice for the user It means your program crashes, with no ability to recover & keep going. Resume after fixing the problem. Harder for the architect, but it s usually what the user wants. Absolutely needed for, e.g., page faults & for user traps for single stepping. EE194/Comp140 Mark Hempstead 40
Precise interrupts An interrupt is precise if, after an instruction takes an interrupt, the machine state is such that All earlier instructions are completed The interrupting instruction seemingly never happened All later instructions seeming never happened. All interrupts are taken in program order Seems easy, right? It s not! EE194/Comp140 Mark Hempstead 41
Interrupts can happen backwards Cycle LDW R3, R2, #3 Axx R4, R3, R5 1 IF 2 ID IF 3 EX ID 4 MEM EX 5 WB MEM 6 WB Cycle #3: the Axx takes an illegal-instruction interrupt, since there is no Axx instruction Cycle 4: the LDW takes an unaligned-memory interrupt, since R2=0x1000 and you cannot read a word from address 0x1003. But the later instruction raised its interrupt first! How do you solve this? Exceptions are only handled in WB EE194/Comp140 Mark Hempstead 42
Interrupts can still happen backwards! Cycle ADD.D F3,F2,F1 ADD R4, R3, R5 1 IF 2 ID IF 3 EX1 ID 4 EX2 EX 5 EX3 MEM WB 6 EX4 EX5 7 8 WB The ADD.D and ADD are in separate pipelines; the floating-point pipe has more EX stages than does the integer pipe Cycle #6: the ADD finishes and writes the register file. Cycle #7: the ADD.F takes an overflow interrupt. But the earlier instruction already changed the RF! EE194/Comp140 Mark Hempstead 43
Branch-delay slots loop: stuff BEQ R2, R4, #loop LD R9, 30(R8) Consider the code The LD is a delay-slot instruction. MIPS always executes it after the branch, even if the branch is taken! What if LD faults after BEQ is taken? Must rerun it and then jump to loop But you re not re-executing the BEQ, so there s no reason to jump to loop. Suddenly restarting after interrupt is harder. EE194/Comp140 Mark Hempstead 44
Summary of Exceptions Precise interrupts are a headache! Delayed branches make them a bigger headache. Preview: OOO makes them a still-bigger headache More ways to have something write back earlier than the exception Some machines punt on the problem Precise exceptions only for integer pipe Special precise mode used for debugging (10x slower) EE194/Comp140 Mark Hempstead 45
Parting quote Democracy is the worst form of government there is except for all of the others. Winston Churchill Corollary: the same is true for imprecise interrupts. EE194/Comp140 Mark Hempstead 46
Backup EE194/Comp140 Mark Hempstead 47
Branch prediction methods When is information about branches gathered/applied? When the machine is designed When the program is compiled ( compile-time ) When a training run of the program is executed ( profile-based ) As the program is executing ( dynamic ) Section 3.3 in textbook ECEC 621 Mark Hempstead 48
Interrupt Taxonomy (C.4) Synchronous vs. Asynchronous (HW error, I/O) User Request (exception?) vs. Coerced User maskable vs. Nonmaskable (Ignorable) Within vs. Between Instructions Resume vs. Terminate The difficult exceptions are resumable interrupts within instructions Save the state, correct the cause, restore the state, continue execution EE194/Comp140 Mark Hempstead 49
Interrupt Taxonomy Exception Type Sync vs. Async User Request Vs. Coerced User mask vs. nommask Within vs. Between Insn Resume vs. terminate I/O Device Req. Async Coerced Nonmask Between Resume Invoke O/S Sync User Nonmask Between Resume Tracing Instructions Sync User Maskable Between Resume Breakpoint Sync User Maskable Between Resume Arithmetic Overflow Sync Coerced Maskable Within Resume Page Fault (not in main m) Sync Coerced Nonmask Within Resume Misaligned Memory Sync Coerced Maskable Within Resume Mem. Protection Violation Sync Coerced Nonmask Within Resume Using Undefined Insns Sync Coerced Nonmask Within Terminate Hardware/Power Failure Async Coerced Nonmask Within Terminate EE194/Comp140 Mark Hempstead 50