Advanced Computer Architecture Concepts

slide1 n.w
1 / 27
Embed
Share

Delve into the intricate world of computer architecture with topics covering pipeline stages, branch prediction, instruction execution, and more. Explore the complexities and optimizations crucial for efficient processor design.

  • Computer Architecture
  • Pipeline Stages
  • Branch Prediction
  • Instruction Execution
  • Processor Design

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 branch decode exec mem wb bubble bubble bubble bubble bubble bubble bubble bubble bubble bubble fetch decode exec mem wb I4 I5 fetch decode exec mem wb Redirected fetch

  2. TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 branch decode exec mem wb bubble bubble bubble bubble bubble fetch decode exec mem wb I2 I3 fetch decode exec mem wb Redirected fetch

  3. Predict PC + 4 Resolve if branch Resolve if non-branch TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 fetch decode exec mem wb I2 fetch decode exec mem wb I3 fetch decode exec mem wb fetch decode exec mem wb I4 I5 fetch decode exec mem wb

  4. Predict PC + 4 Resolve next PC != PC + 4 TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 branch decode exec mem squashed wb I2 fetch decode bubble bubble bubble I3 fetch bubble bubble bubble bubble fetch decode exec mem wb I4 I5 fetch decode exec mem wb Redirected fetch

  5. do { } while (i < 100); if (a[i] != 0) some computation i++; DOWHILE: load in r10 a[i] SKIP: beq some computation r10, r0, SKIP some computation addi r11, r11, 1 blt r11, r12, DOWHILE

  6. Instruction opcode available Calculate Taken PC C2 C1 C3 Select PC+4 or Taken PC I1 fetch cache N decode exec S F? FETCH fetch decode

  7. ITERATION 1 ITERATION 2 DOWHILE: load in r10 a[i] DOWHILE: SKIP: load in r10 a[i] beq some computation FIRST TIME SEEN LEARN NOT TAKEN PREDICT NOT TAKEN r10, r0, SKIP some computation addi blt r11, r11, 1 r11, r12, DOWHILE FIRST TIME SEEN MISPREDICTION PREDICT NOT TAKEN LEARN TAKEN SEEN BEFORE NOT TAKEN PREDICT SAME AS LAST TIME : LEARN NOT TAKEN SKIP: beq some computation r10, r0, SKIP some computation addi blt r11, r11, 1 r11, r12, DOWHILE SEEN BEFORE PREDICT TAKEN PREDICT SAME AS LAST TIME : LEARN NOT TAKEN

  8. V PC N V PC N V PC N

  9. ITERATION 1 DOWHILE: 0x100 SKIP: 0x200 before after 0 0 PC PC 0 0 1 0 0x100 PC 0 0 load in r10 a[i] beq r10, r0, SKIP some computation 0 PC 0 0 PC 0 Predict not taken (default) 1 0 0x100 PC 0 0 1 1 0x100 0x200 0 1 addi r11, r11, 1 blt r11, r12, DOWHILE 0 PC 0 0 PC 0 Predict not taken (default) ITERATION 2 DOWHILE: 0x100 SKIP: 0x200 1 0 0x100 PC 0 0 1 1 0x100 0x200 0 1 load in r10 a[i] beq r10, r0, SKIP some computation 0 PC 0 0 PC 0 Predict not taken (table) addi r11, r11, 1 blt r11, r12, DOWHILE 1 0 0x100 PC 0 0 1 1 0x100 0x200 0 1 0 PC 0 0 PC 0 Predict taken (table)

  10. Accuracy Accuracy = # correct predictions / # all Predictions 100 beq all not taken 100 blt 1 not taken at the end Predictions: Beq: all not taken default Blt: first not taken wrong (default), last taken wrong Accuracy = 100 + 98 / 200 = 99%

  11. How big this needs to be? V PC N V PC N V PC N 4G addresses, 4 bytes per instruction, aligned 1G possible branches 1G entries, each 4 bytes (PC), 2 bits (V & N) TOO LARGE

  12. How big this needs to be? V PC N V PC N V PC N But if we had 1G entries we have 1-to-1 mapping of PC to entry: V N V N 1G V N No need for PC

  13. PC 00 PC 00 V N V N h() V N V N 1G Few entries V N V N PC 00 N h() N Few entries N

  14. PC 00 T T T h() NT 00 01 10 11 T 01 01 NT NT NT 10

  15. movi r18, 3 # max i movi r19, 2 # max j movi r8, 0 DOi: movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj #J branch addi r8, r8, 1 blt r8, r18, DOi # I branch #i = 0 # j = 0 T T NT T T NT T T NT T (11) T(11) T(10) T(11) T (11) T(10) T(11) T (11) T(10)

  16. older younger history PC 0 0 00 h() 0 0 0

  17. movi r9, 0 DOj: DOj: DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction Pattern learned 0 0 1 0 0 0 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction 1 0 1 1 0 0 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj history prediction 1 1 0 1 1 0 PC 00

  18. movi r9, 0 DOj: DOj: DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction Pattern learned 01 1 01 0 PC 00 Learned thus far some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 1 0 1 1 1 0 0 1 1 0 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj

  19. movi r9, 0 Learned thus far DOj: DOj: DOj: 1 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 1 1 0 0 1 1 0 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction correct 10 1 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj correct 11 0 PC 00

  20. Learned thus far movi r9, 0 1 0 1 DOj: DOj: DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 1 1 0 history prediction 0 1 1 01 1 PC 00 0 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction 10 1 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj 11 0 PC 00

  21. PC bimodal Which is best for this branch? gshare

  22. PC bimodal gshare meta

  23. Fast Prediction available C2 C1 C3 Overwriting Prediction fetch decode exec fetch decode

  24. BTB PC TARGET ADDRESS V PC TARGET ADDRESS V PC TARGET ADDRESS V

  25. PC PC+4 Next PC BTB Direction Predictor

  26. Calls and returns

  27. If (error != 0) error_handle(); If (a[i] < threshold) a++; else b++; Load a[i] in r8 blt r8, r9, THEN ELSE: addi r10, r10, 1 br THEN: addi r11, r11, 1 DONE: # r9 holds threshold # b++ DONE # a++ Load a[i] in r8 cmplt c0, r8, r9 c0: addi r10, r10, 1 !c0: addi r11, r11, 1 # condition register c0 = r8 < r9

Related


More Related Content