Advanced Computer Architecture Concepts

1 / 27

Embed Share

Delve into the intricate world of computer architecture with topics covering pipeline stages, branch prediction, instruction execution, and more. Explore the complexities and optimizations crucial for efficient processor design.

brunelli_m Follow

Uploaded on Apr 19, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 branch decode exec mem wb bubble bubble bubble bubble bubble bubble bubble bubble bubble bubble fetch decode exec mem wb I4 I5 fetch decode exec mem wb Redirected fetch

TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 branch decode exec mem wb bubble bubble bubble bubble bubble fetch decode exec mem wb I2 I3 fetch decode exec mem wb Redirected fetch

Predict PC + 4 Resolve if branch Resolve if non-branch TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 fetch decode exec mem wb I2 fetch decode exec mem wb I3 fetch decode exec mem wb fetch decode exec mem wb I4 I5 fetch decode exec mem wb

Predict PC + 4 Resolve next PC != PC + 4 TIME C2 C7 C9 C1 C4 C8 C3 C5 C6 I1 branch decode exec mem squashed wb I2 fetch decode bubble bubble bubble I3 fetch bubble bubble bubble bubble fetch decode exec mem wb I4 I5 fetch decode exec mem wb Redirected fetch

do { } while (i < 100); if (a[i] != 0) some computation i++; DOWHILE: load in r10 a[i] SKIP: beq some computation r10, r0, SKIP some computation addi r11, r11, 1 blt r11, r12, DOWHILE

Instruction opcode available Calculate Taken PC C2 C1 C3 Select PC+4 or Taken PC I1 fetch cache N decode exec S F? FETCH fetch decode

ITERATION 1 ITERATION 2 DOWHILE: load in r10 a[i] DOWHILE: SKIP: load in r10 a[i] beq some computation FIRST TIME SEEN LEARN NOT TAKEN PREDICT NOT TAKEN r10, r0, SKIP some computation addi blt r11, r11, 1 r11, r12, DOWHILE FIRST TIME SEEN MISPREDICTION PREDICT NOT TAKEN LEARN TAKEN SEEN BEFORE NOT TAKEN PREDICT SAME AS LAST TIME : LEARN NOT TAKEN SKIP: beq some computation r10, r0, SKIP some computation addi blt r11, r11, 1 r11, r12, DOWHILE SEEN BEFORE PREDICT TAKEN PREDICT SAME AS LAST TIME : LEARN NOT TAKEN

V PC N V PC N V PC N

ITERATION 1 DOWHILE: 0x100 SKIP: 0x200 before after 0 0 PC PC 0 0 1 0 0x100 PC 0 0 load in r10 a[i] beq r10, r0, SKIP some computation 0 PC 0 0 PC 0 Predict not taken (default) 1 0 0x100 PC 0 0 1 1 0x100 0x200 0 1 addi r11, r11, 1 blt r11, r12, DOWHILE 0 PC 0 0 PC 0 Predict not taken (default) ITERATION 2 DOWHILE: 0x100 SKIP: 0x200 1 0 0x100 PC 0 0 1 1 0x100 0x200 0 1 load in r10 a[i] beq r10, r0, SKIP some computation 0 PC 0 0 PC 0 Predict not taken (table) addi r11, r11, 1 blt r11, r12, DOWHILE 1 0 0x100 PC 0 0 1 1 0x100 0x200 0 1 0 PC 0 0 PC 0 Predict taken (table)

Accuracy Accuracy = # correct predictions / # all Predictions 100 beq all not taken 100 blt 1 not taken at the end Predictions: Beq: all not taken default Blt: first not taken wrong (default), last taken wrong Accuracy = 100 + 98 / 200 = 99%

How big this needs to be? V PC N V PC N V PC N 4G addresses, 4 bytes per instruction, aligned 1G possible branches 1G entries, each 4 bytes (PC), 2 bits (V & N) TOO LARGE

How big this needs to be? V PC N V PC N V PC N But if we had 1G entries we have 1-to-1 mapping of PC to entry: V N V N 1G V N No need for PC

PC 00 PC 00 V N V N h() V N V N 1G Few entries V N V N PC 00 N h() N Few entries N

PC 00 T T T h() NT 00 01 10 11 T 01 01 NT NT NT 10

movi r18, 3 # max i movi r19, 2 # max j movi r8, 0 DOi: movi r9, 0 DOj: some computation addi r9, r9, 1 blt r9, r19, DOj #J branch addi r8, r8, 1 blt r8, r18, DOi # I branch #i = 0 # j = 0 T T NT T T NT T T NT T (11) T(11) T(10) T(11) T (11) T(10) T(11) T (11) T(10)

older younger history PC 0 0 00 h() 0 0 0

movi r9, 0 DOj: DOj: DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction Pattern learned 0 0 1 0 0 0 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction 1 0 1 1 0 0 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj history prediction 1 1 0 1 1 0 PC 00

movi r9, 0 DOj: DOj: DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction Pattern learned 01 1 01 0 PC 00 Learned thus far some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 1 0 1 1 1 0 0 1 1 0 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj

movi r9, 0 Learned thus far DOj: DOj: DOj: 1 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 1 1 0 0 1 1 0 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction correct 10 1 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj correct 11 0 PC 00

Learned thus far movi r9, 0 1 0 1 DOj: DOj: DOj: some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 1 1 0 history prediction 0 1 1 01 1 PC 00 0 0 1 some computation addi r9, r9, 1 blt r9, r19, DOj movi r9, 0 history prediction 10 1 PC 00 some computation addi r9, r9, 1 blt r9, r19, DOj 11 0 PC 00

PC bimodal Which is best for this branch? gshare

PC bimodal gshare meta

Fast Prediction available C2 C1 C3 Overwriting Prediction fetch decode exec fetch decode

BTB PC TARGET ADDRESS V PC TARGET ADDRESS V PC TARGET ADDRESS V

PC PC+4 Next PC BTB Direction Predictor

Calls and returns

If (error != 0) error_handle(); If (a[i] < threshold) a++; else b++; Load a[i] in r8 blt r8, r9, THEN ELSE: addi r10, r10, 1 br THEN: addi r11, r11, 1 DONE: # r9 holds threshold # b++ DONE # a++ Load a[i] in r8 cmplt c0, r8, r9 c0: addi r10, r10, 1 !c0: addi r11, r11, 1 # condition register c0 = r8 < r9

Advanced Computer Architecture Concepts

Download Presentation

Presentation Transcript

Related

More Related Content