Understanding Instruction Flow Techniques in High-IPC Processors

Slide Note
Embed
Share

Explore the intricate processes involved in optimizing instruction flow within high-IPC processors, tackling challenges such as control dependences, branch speculation, and branch direction prediction. Learn about the goals, impediments, branch types, and implementations that shape the efficient execution of instructions in modern computing systems.


Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Instruction Flow Techniques ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison

  2. High-IPC Processor I-cache Instruction Flow Branch Predictor FETCH Instruction Buffer DECODE Memory Integer Floating-point Media Memory Data Flow EXECUTE Reorder Buffer (ROB) Register Data Flow COMMIT D-cache Store Queue Mikko Lipasti-University of Wisconsin 2

  3. Instruction Flow Techniques Instruction Flow and its Impediments Control Dependences Control Flow Speculation Branch Speculation Mis-speculation Recovery Branch Direction Prediction Static Prediction A brief history of dynamic branch prediction Branch Target Prediction High-bandwidth Fetch High-Frequency Fetch Mikko Lipasti-University of Wisconsin 3

  4. Goal and Impediments Goal of Instruction Flow Supply processor with maximum number of useful instructions every clock cycle Impediments Branches and jumps Finite I-Cache Capacity Bandwidth restrictions Mikko Lipasti-University of Wisconsin 4

  5. Branch Types and Implementation 1. Types of Branches A. Conditional or Unconditional B. Save PC? C. How is target computed? Single target (immediate, PC+immediate) Multiple targets (register) Branch Architectures A. Condition code or condition registers B. Register 2. Mikko Lipasti-University of Wisconsin 5

  6. Whats So Bad About Branches? Effects of Branches Fragmentation of I-Cache lines Need to determine branch direction Need to determine branch target Use up execution resources Pipeline drain/fill Mikko Lipasti-University of Wisconsin 6

  7. Whats So Bad About Branches? Problem: Fetch stalls until direction is determined Solutions: Minimize delay Move instructions determining branch condition away from branch (CC architecture) Make use of delay Non-speculative: Fill delay slots with useful safe instructions Execute both paths (eager execution) Speculative: Predict branch direction Mikko Lipasti-University of Wisconsin 7

  8. Whats So Bad About Branches? Problem: Fetch stalls until branch target is determined Solutions: Minimize delay Generate branch target early Make use of delay: Predict branch target Single target Multiple targets Mikko Lipasti-University of Wisconsin 8

  9. Control Dependences main: addi r2, r0, A addi r3, r0, B addi r4, r0, C BB 1 addi r5, r0, N add r10,r0, r0 bge r10,r5, end loop: lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,T1 sw r21, 0(r4) BB 3 b T2 T1: sw r20, 0(r4) BB 4 T2: addi r10,r10,1 addi r2, r2, 4 addi r3, r3, 4 BB 5 addi r4, r4, 4 blt r10,r5, loop end: BB 1 BB 2 BB 3 BB 4 BB 5 Control Flow Graph Shows possible paths of control flow through basic blocks Mikko Lipasti-University of Wisconsin 9

  10. Control Dependences main: addi r2, r0, A addi r3, r0, B addi r4, r0, C BB 1 addi r5, r0, N add r10,r0, r0 bge r10,r5, end loop: lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,T1 sw r21, 0(r4) BB 3 b T2 T1: sw r20, 0(r4) BB 4 T2: addi r10,r10,1 addi r2, r2, 4 addi r3, r3, 4 BB 5 addi r4, r4, 4 blt r10,r5, loop end: BB 1 BB 2 BB 3 BB 4 BB 5 Control Dependence Node B is CD on Node A if A determines whether B executes If path 1 from A to exit includes B, and path 2 does not, then B is control-dependent on A Mikko Lipasti-University of Wisconsin 10

  11. Program Control Flow Implicit Sequential Control Flow Static Program Representation Control Flow Graph (CFG) Nodes = basic blocks Edges = Control flow transfers Physical Program Layout Mapping of CFG to linear program memory Implied sequential control flow Dynamic Program Execution Traversal of the CFG nodes and edges (e.g. loops) Traversal dictated by branch conditions Dynamic Control Flow Deviates from sequential control flow Disrupts sequential fetching Can stall IF stage and reduce I-fetch bandwidth

  12. Program Control Flow Dynamic traversal of static CFG Mapping CFG to linear memory

  13. Limits on Instruction Level Parallelism (ILP) 1970: Flynn Weiss and Smith [1984] Sohi and Vajapeyam [1987] Tjaden and Flynn [1970] Tjaden and Flynn [1973] Uht [1986] Smith et al. [1989] Jouppi and Wall [1988] Johnson [1991] Acosta et al. [1986] Wedig [1982] Butler et al. [1991] Melvin and Patt [1991] Wall [1991] Kuck et al. [1972] Riseman and Foster [1972] Nicolau and Fisher [1984] 1.58 1.81 1.86 (Flynn s bottleneck) 1.96 2.00 2.00 2.40 2.50 2.79 3.00 5.8 6 7 (Jouppi disagreed) 8 51 (no control dependences) 90 (Fisher s optimism) Mikko Lipasti-University of Wisconsin 13

  14. Riseman and Fosters Study 1970: Flynn 1972: Riseman/Foster 7 benchmark programs on CDC-3600 Assume infinite machines Infinite memory and instruction stack Infinite register file Infinite functional units True dependencies only at dataflow limit If bounded to single basic block, speedup is 1.72 (Flynn s bottleneck) If one can bypass n branches (hypothetically), then: Branches Bypassed Speedup 0 1 2 8 32 128 1.72 2.72 3.62 7.21 14.8 24.4 51.2 Mikko Lipasti-University of Wisconsin 14

  15. Branch Prediction 1970: Flynn 1972: Riseman/Foster 1979: Smith Predictor Riseman & Foster showed potential But no idea how to reap benefit 1979: Jim Smith patents branch prediction at Control Data Predict current branch based on past history Today: virtually all processors use branch prediction Mikko Lipasti-University of Wisconsin 15

  16. Disruption of Sequential Control Flow Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish Reorder/ Completion Buffer Complete Store Buffer Retire Mikko Lipasti-University of Wisconsin 16

  17. Branch Prediction Target address generation Target Speculation Access register: PC, General purpose register, Link register Perform calculation: +/- offset, autoincrement, autodecrement Condition resolution Condition speculation Access register: Condition code register, General purpose register Perform calculation: Comparison of data register(s) Mikko Lipasti-University of Wisconsin 17

  18. Target Address Generation Fetch Decode Buffer PC- rel. Decode Reg. ind. Dispatch Buffer Reg. ind. with offset Dispatch Reservation Stations Issue Branch Execute Finish Completion Buffer Complete Store Buffer Retire 18 Mikko Lipasti-University of Wisconsin

  19. Condition Resolution Fetch Decode Buffer Decode CC reg. Dispatch Buffer GP reg. value comp. Dispatch Reservation Stations Issue Branch Execute Finish Completion Buffer Complete Store Buffer Retire 19 Mikko Lipasti-University of Wisconsin

  20. Branch Instruction Speculation to I-cache Prediction FA-mux Spec. target PC(seq.) = FA (fetch address) PC(seq.) Fetch Branch Predictor (using a BTB) Spec. cond. Decode Buffer BTB update (target addr. and history) Decode Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish Completion Buffer Mikko Lipasti-University of Wisconsin 20

  21. Branch/Jump Target Prediction 0x0348 0101 (NTNT) 0x0612 Branch inst. Information Branch target address for predict. address (most recent) Branch Target Buffer: small cache in fetch stage Previously executed branches, address, taken history, target(s) Fetch stage compares current FA against BTB If match, use prediction If predict taken, use BTB target When branch executes, BTB is updated Optimization: Size of BTB: increases hit rate Prediction algorithm: increase accuracy of prediction Mikko Lipasti-University of Wisconsin 21

  22. Branch Prediction: Condition Speculation 1. Biased Not Taken Hardware prediction Does not affect ISA Not effective for loops Software Prediction Extra bit in each branch instruction Set to 0 for not taken Set to 1 for taken Bit set by compiler or user; can use profiling Static prediction, same behavior every time Prediction based on branch offset Positive offset: predict not taken Negative offset: predict taken Prediction based on dynamic history 2. 3. 4. Mikko Lipasti-University of Wisconsin 22

  23. UCB Study [Lee and Smith, 1984] Benchmarks used 26 programs (IBM 370, DEC PDP-11, CDC 6400) 6 workloads (4 IBM, 1 DEC, 1 CDC) Used trace-driven simulation Branch types Unconditional: always taken or always not taken Subroutine call: always taken Loop control: usually taken Decision: either way, if-then-else Computed goto: always taken, with changing target Supervisor call: always taken Execute: always taken (IBM 370) IBM1 IBM2 IBM3 IBM4 DEC T 0.640 0.657 NT 0.360 0.343 Mikko Lipasti-University of Wisconsin IBM1: compiler IBM2: cobol (business app) IBM3: scientific IBM4: supervisor (OS) CDC 0.778 0.222 Avg 0.676 0.324 0.704 0.296 0.540 0.460 0.738 0.262 23

  24. Branch Prediction Function Prediction function F(X1, X2, ) X1 opcode type X2 history Prediction effectiveness based on opcode only, or history IBM1 66 64 92 93 94 95 95 IBM2 69 64 95 97 97 97 97 IBM3 71 70 87 91 91 92 92 IBM4 55 54 80 83 84 84 84 DEC 80 74 97 98 98 98 98 CDC 78 78 82 91 94 95 96 Opcode only History 0 History 1 History 2 History 3 History 4 History 5 Mikko Lipasti-University of Wisconsin 24

  25. Example Prediction Algorithm Branch inst. Information Branch target address for predict. address T T TT T T TT NT T T N T N NN N TN TN T T N N Hardware table remembers last 2 branch outcomes History of past several branches encoded by FSM Current state used to generate prediction Results: Workload Accuracy IBM1 93 IBM2 97 IBM3 91 IBM4 83 DEC 98 CDC 91 Mikko Lipasti-University of Wisconsin 25

  26. IBM Study [Nair, 1992] Branch processing on the IBM RS/6000 Separate branch functional unit Overlap of branch instructions with other instructions Zero cycle branches Two causes for branch stalls Unresolved conditions Branches downstream too close to unresolved branches Investigated optimal FSM design for branch prediction Mikko Lipasti-University of Wisconsin 26

  27. Exhaustive Search for Optimal 2-bit Predictor There are 220 possible state machines of 2-bit predictors Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 For each benchmark, determine prediction accuracy for all the predictor state machines Find optimal 2-bit predictor for each application Mikko Lipasti-University of Wisconsin 27

  28. Branch Prediction Implementation (PPC 604) to I-cache Prediction FA-mux Spec. target FA (fetch address) FA Fetch Branch Predictor Decode Buffer Decode Branch Predictor Update Dispatch Buffer Dispatch Reservation Stations FPU BRN SFX SFX CFX LS Issue Branch Execute Finish Completion Buffer Mikko Lipasti-University of Wisconsin 28

  29. BTAC and BHT Design (PPC 604) FA-mux FA I-cache FAR FA FA Branch Target Address Cache (BTAC) Branch History Table (BHT) +4 BTAC update Decode Buffer BHT update Decode BHT prediction Dispatch Buffer BTAC prediction Dispatch BTAC: - 64 entries - fully associative - hit => predict taken Reservation Stations FPU SFX CFX LS BRN SFX Issue Branch BHT: - 512 entries - direct mapped - 2-bit saturating counter history based prediction - overrides BTAC prediction Execute Finish Completion Buffer Mikko Lipasti-University of Wisconsin 29

  30. Branch Speculation T NT NT NT T T (TAG 2) NT NT NT NT T T T T (TAG 1) (TAG 3) Start new correct path Must remember the alternate (non-predicted) path Eliminate incorrect path Must ensure that the mis-speculated instructions produce no side effects Mikko Lipasti-University of Wisconsin 30

  31. Mis-speculation Recovery Start new correct path 1. Update PC with computed branch target (if predicted NT) 2. Update PC with sequential instruction address (if predicted T) 3. Can begin speculation again at next branch Eliminate incorrect path 1. Use tag(s) to deallocate ROB entries occupied by speculative instructions 2. Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations Mikko Lipasti-University of Wisconsin 31

  32. Tracking Instructions Assign branch tags Allocated in circular order Instruction carries this tag throughout processor Often: track instruction groups Instructions managed in groups, max. one branch per group ROB structured as groups Leads to some inefficiency Simpler tracking of speculative instructions Mikko Lipasti-University of Wisconsin 32

  33. BTAC and BHT Design (PPC 604) Fairly simple, 5- stage machine from 1994 Many sources for PC redirect Lots of complexity Mikko Lipasti-University of Wisconsin 33

  34. Instruction Flow Techniques Instruction Flow and its Impediments Control Dependences Control Flow Speculation Branch Speculation Mis-speculation Recovery Branch Direction Prediction Static Prediction A brief history of dynamic branch prediction Branch Target Prediction High-bandwidth Fetch High-Frequency Fetch Mikko Lipasti-University of Wisconsin 34

  35. Static Branch Prediction Single-direction Always not-taken: Intel i486 Backwards Taken/Forward Not Taken Loop-closing branches Used as backup in Pentium Pro, II, III, 4 Heuristic-based: void * p = malloc (numBytes); if (p == NULL) errorHandlingFunction( ); Mikko Lipasti-University of Wisconsin 35

  36. Static Branch Prediction Heuristic Name Loop Branch If the branch target is back to the head of a loop, predict taken. Description If a branch compares a pointer with NULL, or if two pointers are compared, predict in the direction that corresponds to the pointer being not NULL, or the two pointers not being equal. Pointer If a branch is testing that an integer is less than zero, less than or equal to zero, or equal to a constant, predict in the direction that corresponds to the test evaluating to false. Opcode If the operand of the branch instruction is a register that gets used before being redefined in the successor block, predict that the branch goes to the successor block. Guard If a branch occurs inside a loop, and neither of the targets is the loop head, then predict that the branch does not go to the successor that is the loop exit. Loop Exit Loop Header Predict that the successor block of a branch that is a loop header or a loop pre-header is taken. If a successor block contains a subroutine call, predict that the branch goes to that successor block. Call If a successor block contains a store instruction, predict that the branch does not go to that successor block. Store If a successor block contains a return from subroutine instruction, predict that the branch does not go to that successor block. Return Heuristic-based: Ball/Larus Thomas Ball and James R. Larus. Branch Prediction for Free. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 300-313, May 1993. Mikko Lipasti-University of Wisconsin 36

  37. Static Branch Prediction Profile-based 1. Instrument program binary 2. Run with representative (?) input set 3. Recompile program a. Annotate branches with hint bits, or b. Restructure code to match predict not-taken Best performance: 75-80% accuracy Mikko Lipasti-University of Wisconsin 37

  38. Dynamic Branch Prediction Main advantages: Learn branch behavior autonomously No compiler analysis, heuristics, or profiling Adapt to changing branch behavior Program phase changes branch behavior First proposed in 1979-1980 US Patent #4,370,711, Branch predictor using random access memory, James. E. Smith Continually refined since then Mikko Lipasti-University of Wisconsin 38

  39. Smith Predictor Hardware 2mk-bit counters Branch Address Updated Counter Value m Saturating Counter Increment/Decrement Branch Prediction Branch Outcome most significant bit Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135-148, May 1981 Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc. Mikko Lipasti-University of Wisconsin 39

  40. Two-level Branch Prediction PHT 000000 000001 000010 000011 PC = 01011010010101 010110 010100 010101 010110 010111 1 0 BHR 0110 111110 111111 1 Branch Prediction BHR adds global branch history Provides more context Can differentiate multiple instances of the same static branch Can correlate behavior across multiple static branches Mikko Lipasti-University of Wisconsin 40

  41. Two-level Prediction: Local History PHT 0000000 0000001 0000010 0000011 PC = 01011010010101 BHT 0101110 000 001 010 011 100 101 110 111 0101100 0101101 0101110 0101111 0 1 110 0111110 0111111 0 Branch Prediction Detailed local history can be useful Mikko Lipasti-University of Wisconsin 41

  42. Local History Predictor Example Loop closing branches Must identify last instance Local history dedicates PHT entry to each instance 0111 entry predicts not taken Loop closing branch s history PHT 0000 11101110111011101110 0001 0010 0011 0100 0101 0110 0111 00 1000 1001 1010 1011 11 1100 1101 11 1110 11 1111 Mikko Lipasti-University of Wisconsin 42

  43. Two-level Taxonomy 1970: Flynn 1972: Riseman/Foster Based on indices for branch history and pattern history BHR: {G,P,S}: {Global, Per-address, Set} PHT: {g,p,s}: {Global, Per-address, Set} 9 combinations: GAg, GAp, GAs, PAg, PAp, PAs, SAg, SAp and SAs Tse-Yu Yeh and Yale N. Patt. Two-Level Adaptive Branch Prediction. International Symposium on Microarchitecture, pages 51-61, November 1991. 1979: Smith Predictor 1991: Two-level prediction Mikko Lipasti-University of Wisconsin 43

  44. Index Sharing in Two-level Predictors gshare 1970: Flynn 1972: Riseman/Foster GAp 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 BHR PC 1101 0110 1979: Smith Predictor 1101 BHR 1001 0110 PC XOR 1011 1991: Two-level prediction 1993: gshare, tournament 1001 BHR BHR PC 1001 1010 1001 1010 PC XOR 0011 Use XOR function to achieve better utilization of PHT Scott McFarling. Combining Branch Predictors. TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993. Used in e.g. IBM Power 4, Alpha 21264 Mikko Lipasti-University of Wisconsin 44

  45. Sources of Mispredictions Lack of history (training time) Randomized behavior Usually due to randomized input data (benchmarks) Surprisingly few branches depend on input data values BHR capacity Correlate to branch that already shifted out E.g. loop count > BHR width PHT capacity Aliasing/interference Positive Negative Mikko Lipasti-University of Wisconsin 45

  46. Reducing Interference Compulsory aliasing (cold miss) Not important (less than 1%) Only remedy is to set appropriate initial value Also: beware indexing schemes with high training cost (e.g. very long branch history) Capacity aliasing (capacity miss) Increase PHT size Conflict aliasing (conflict miss) Change indexing scheme or partition PHT in a clever fashion Mikko Lipasti-University of Wisconsin 46

  47. Bi-Mode Predictor Branch Address choice predictor PHT0 PHT1 Global BHR XOR PHT partitioned into T/NT halves Selector chooses source Reduces negative interference, since most entries in PHT0 tend towards NT, and most entries in PHT1 tend towards T Used by ARM Cortex-A15 Final Prediction Mikko Lipasti-University of Wisconsin 47

  48. gskewed Predictor PHT0 PHT1 PHT2 Branch Address f0 Global BHR f1 f2 Majority Final Prediction Multiple PHT banks indexed by different hash functions Conflicting branch pair unlikely to conflict in more than one PHT Majority vote determines prediction Used in Alpha EV8 (ultimately cancelled) P. Michaud, A. Seznec, and R. Uhlig. Trading Conflict and Capacity Aliasing in Conditional Branch Predictors. ISCA-24, June 1997 Mikko Lipasti-University of Wisconsin 48

  49. Agree Predictor biasing bits Branch Address Global BHR XOR PHT 1 0 Prediction 1 = agree with bias bit 0 = disagree Same principle as bi-mode PHT records whether branch bias matches outcome Exploits 70-80% static predictability Used in HP PA-8700 E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt. The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference. ISCA-24, June 1997. Mikko Lipasti-University of Wisconsin 49

  50. YAGS Predictor 1970: Flynn 1972: Riseman/Foster Branch Address choice PHT Global BHR 1979: Smith Predictor XOR T-cache NT-cache 1991: Two-level prediction 1993: gshare, tournament Partial Tag 2bC Partial Tag 2bC 1998: Cache exceptions Based on bi-mode T/NT PHTs cache only the exceptions A. N. Eden and T. N. Mudge. The YAGS Branch Prediction Scheme. MICRO, Dec 1998. = = 0 1 0 1 0 1 T/NT-cache hit? Final Prediction Mikko Lipasti-University of Wisconsin 50

Related