Reorder Buffers in Computer Architecture

 
EECS 470
 
Adding a RoB
 
Lecture 7 – Winter 2024
 
 
 
Slides developed in part by Profs. Austin, Brehob, Falsafi, 
Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie
Mellon University, Purdue University, University of Michigan, University of
Pennsylvania, and University of Wisconsin.
 
Last time:
 
Covered branch predictors
Direction
Bimodal, local history, global, gshare, tournament
Address
BTB
RAS (briefly)
 
Warning: Crazy times coming
 
HW2 due on Friday 2/2 (tomorrow!)
Will do group formation and project materials on Tuesday 2/6.
G
r
o
u
p
 
f
o
r
m
a
t
i
o
n
 
w
i
l
l
 
b
e
 
d
o
n
e
 
i
n
 
c
l
a
s
s
 
o
n
 
T
h
u
r
s
d
a
y
 
2
/
8
P3 is due on Sunday 2/9
It’s a lot of work (20 hours?)
Proposal is due on Tuesday 2/13
It’s not a lot of work (1 hour?) to do the write-up, but you’ll need to meet
with your group and discuss things.
Don’t worry too much about getting this right.  You’ll be allowed to change
(we’ll meet the following Friday).  Just a line in the sand.  Will discuss on 2/6.
HW3 is due on Wednesday 2/14.  No late homework. Answers
posted right after it’s due!
Midterm is on Thursday 2/15 in the evening
Review session over that weekend (date/time TBA)
Q&A in class 2/15
Best way to study is look at old exams (posted on-line later this week)
20 minute proposal meetings on Friday 2/16 rather than lab
 
General speculation
 
Control speculation
“I think this branch will go to address 90004”
Data speculation
“I’ll guess the result of the load will be zero”
Memory conflict speculation
“I don’t think this load conflicts with any proceeding
store.”
Error speculation
“I don’t think there were any errors in this calculation”
Speculation in general
 
Need to be 100% sure on final
correctness!
So need a recovery mechanism
Must make forward progress!
Want to speed up overall performance
S
o
 
r
e
c
o
v
e
r
y
 
c
o
s
t
 
s
h
o
u
l
d
 
b
e
 
l
o
w
 
o
r
 
e
x
p
e
c
t
e
d
r
a
t
e
 
o
f
 
o
c
c
u
r
r
e
n
c
e
 
s
h
o
u
l
d
 
b
e
 
l
o
w
.
There can be a real trade-off on 
accuracy
,
cost of recovery
, and 
speedup when correct.
Should keep the worst case in mind…
M
E
M
 
Precise Interrupts and branches via the
Reorder Buffer
 
@
 
A
l
l
o
c
Allocate result storage at Tail
@
 
S
c
h
e
d
Get inputs (ROB T-to-H then ARF)
Wait until all inputs ready
@
 
W
B
Write results/fault to ROB
Indicate result is ready
@
 
C
T
Wait until inst @ Head is done
If fault, initiate handler
Else, write results to ARF
Deallocate entry from ROB
I
F
I
D
A
l
l
o
c
S
c
h
e
d
E
X
C
T
 
Head
 
Tail
PC
Dst regID
Dst value
Except?
 
Reorder Buffer (ROB)
Circular queue of spec state
May contain multiple definitions
of 
same
 register
 
In-order
 
In-order
 
Any order
A
R
F
 
Reorder Buffer Example
 
Code Sequence
 
  f1 = f2 / f3
  r3 = r2 + r3
  r4 = r3 – r2
 
Initial Conditions
 
  - reorder buffer empty
  - f2 = 3.0
  - f3 = 2.0
  - r2 = 6
  - r3 = 5
 
ROB
 
Time
 
H
 
T
 
regID: f1
result: ?
Except: ?
 
H
 
T
 
regID: f1
result: ?
Except: ?
 
regID: r3
result: ?
Except: ?
 
H
 
T
 
regID: f1
result: ?
Except: ?
 
regID: r3
result: 11
Except: N
 
regID: r4
result: ?
Except: ?
r3
 
regID: r8
result: 2
Except: n
 
regID: r8
result: 2
Except: n
 
regID: r8
result: 2
Except: n
 
Reorder Buffer Example
 
Code Sequence
 
  f1 = f2 / f3
  r3 = r2 + r3
  r4 = r3 – r2
 
Initial Conditions
 
  - reorder buffer empty
  - f2 = 3.0
  - f3 = 2.0
  - r2 = 6
  - r3 = 5
 
ROB
 
Time
 
H
 
T
 
regID: f1
result: ?
Except: ?
 
regID: r3
result: 11
Except: n
 
regID: r4
result: 5
Except: n
 
H
 
T
 
regID: f1
result: ?
Except: y
 
regID: r3
result: 11
Except: n
 
regID: r4
result: 5
Except: n
 
regID: r8
result: 2
Except: n
 
regID: r8
result: 2
Except: n
 
H
 
T
 
regID: f1
result: ?
Except: y
 
regID: r3
result: 11
Except: n
 
regID: r4
result: 5
Except: n
 
Reorder Buffer Example
 
Code Sequence
 
  f1 = f2 / f3
  r3 = r2 + r3
  r4 = r3 – r2
 
Initial Conditions
 
  - reorder buffer empty
  - f2 = 3.0
  - f3 = 2.0
  - r2 = 6
  - r3 = 5
 
ROB
 
Time
 
H
 
T
 
H
 
T
 
first inst
of fault
handler
 
There is more complexity here
 
Rename table needs to be cleared
Everything is in the ARF
Really do need to finish everything which was
before the faulting instruction in program order.
What about branches?
Would need to drain everything before the branch.
Why not just squash everything that follows it?
 
And while we’re at it…
 
Does the ROB replace the RS?
Is this a good thing?  Bad thing?
 
ROB
 
ROB
ROB is an 
in-order
 queue where instructions are placed.
Instructions 
complete
 
(retire) in-order
Instructions still 
execute
 out-of-order
Still use RS
Instructions are issued to RS and ROB at the same time
Rename is to ROB entry, not RS.
When 
execute
 done instruction leaves RS
Only when all instructions in before it in program order are
done does the instruction retire.
 
Adding a Reorder Buffer
 
Tomasulo Data Structures
(Timing Free Example, “P6 scheme”)
 
Review Questions
 
Could we make this work without the RS?
If so, why do we do that?
Why is it important to retire in order?
Why must branches wait until retirement before
they announce their mispredict?
Any other ways to do this?
 
 
More review questions
 
1.
What is the purpose of the RoB?
2.
Why do we have both a RoB and a RS?
Yes, that was pretty much on the last page…
3.
Misprediction
a)
When to we resolve a mis-prediction?
b)
What happens to the main structures (RS, RoB,
ARF, Rename Table) when we mispredict?
4.
What is the whole purpose of OoO execution?
 
And yet more review questions!
 
1.
What is the purpose of the RoB?
2.
Why do we have both a RoB and a RS?
3.
Misprediction
a)
When to we resolve a mis-prediction?
b)
What happens to the main structures (RS, RoB,
ARF, Rename Table) when we mispredict?
4.
What is the whole purpose of OoO execution?
 
When an instruction is 
dispatched
 how does
it impact each major structure?
 
Rename table?
 
ARF?
 
RoB?
 
RS?
 
When an instruction 
completes execution
 how
does it impact each major structure?
 
Rename table?
 
ARF?
 
RoB?
 
RS?
 
When an instruction 
retires
 how does it
impact each major structure?
 
Rename table?
 
ARF?
 
RoB?
 
RS?
 
Topic change
 
Why on earth are we doing this?
Why do we think it helps?
 
Homework 2 problems 5 and 6 made the
argument.
Only need to obey true data dependencies.
H
u
g
e
 
s
p
e
e
d
u
p
 
p
o
t
e
n
t
i
a
l
.
 
Optimizing CPU Performance
 
Golden Rule: t
CPU
 = N
inst
*CPI*t
CLK
Given this, what are our options
Reduce the number of instructions executed
Reduce the cycles to execute an instruction
Reduce the clock period
Our first focus: Reducing CPI
Approach: 
Instruction Level Parallelism
 (ILP)
 
Why ILP?
 
V
s
.
 
Requirements
Parallelism
Large window
Limited control deps
Eliminate “false” deps
Find run-time deps
 
How Much ILP is There?
(Chapter 3.10)
 
How Large Must the “Window” Be?
 
ALU Operation 
GOOD
, Branch 
BAD
Expected Number of Branches
Between Mispredicts
 
E(X) ~ 1/(1-p)
 
E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts
 
How Accurate are Branch Predictors?
 
Impact of Physical Storage Limitations
 
Each instruction “in flight” must have storage for its
result
Really worse than this because of mispeculation…
 
Registers 
GOOD
, Memory 
BAD
 
Benefits of registers
Well described deps
Fast access
Finite resource
Memory loses these benefits
for flexibility
*p = …
*q = …
… = *p
 
?
 
“Bottom Line” for an Ambitious Design
Slide Note
Embed
Share

Explore the concept of Reorder Buffers (ROB) in computer architecture as a crucial component for precise interrupts and branch handling. Developed by a collaborative effort from notable professionals, delve into topics like branch predictors, speculation mechanisms, and practical examples of ROB operations. Gain insights into the importance of speculation accuracy, recovery mechanisms, and trade-offs for enhanced overall performance in computing systems.

  • Computer architecture
  • Reorder Buffer
  • Speculation mechanisms
  • Branch prediction
  • Performance optimization

Uploaded on Mar 09, 2024 | 5 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. EECS 470 Adding a RoB Lecture 7 Winter 2024 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.

  2. Last time: Covered branch predictors Direction Bimodal, local history, global, gshare, tournament Address BTB RAS (briefly)

  3. General speculation Control speculation I think this branch will go to address 90004 Data speculation I ll guess the result of the load will be zero Memory conflict speculation I don t think this load conflicts with any proceeding store. Error speculation I don t think there were any errors in this calculation

  4. Speculation in general Need to be 100% sure on final correctness! So need a recovery mechanism Must make forward progress! Want to speed up overall performance So recovery cost should be low or expected rate of occurrence should be low. There can be a real trade-off on accuracy, cost of recovery, and speedup when correct. Should keep the worst case in mind

  5. Precise Interrupts and branches via the Reorder Buffer @ Alloc Allocate result storage at Tail @ Sched Get inputs (ROB T-to-H then ARF) Wait until all inputs ready @ WB Write results/fault to ROB Indicate result is ready @ CT Wait until inst @ Head is done If fault, initiate handler Else, write results to ARF Deallocate entry from ROB Any order MEM EX IF ID CT Alloc Sched In-order In-order ROB ARF PC Dst regID Dst value Except? Reorder Buffer (ROB) Circular queue of spec state May contain multiple definitions of same register Head Tail

  6. Reorder Buffer Example ROB Code Sequence regID: r8 result: 2 Except: n regID: f1 result: ? Except: ? f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 H T regID: r8 result: 2 Except: n regID: r3 result: ? Except: ? regID: f1 result: ? Except: ? Time Initial Conditions - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H T regID: r8 result: 2 Except: n regID: r4 result: ? Except: ? regID: r3 result: 11 Except: N regID: f1 result: ? Except: ? r3 H T

  7. Reorder Buffer Example ROB Code Sequence regID: r8 result: 2 Except: n regID: r4 result: 5 Except: n regID: r3 result: 11 Except: n regID: f1 result: ? Except: ? f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 H T regID: r4 result: 5 Except: n regID: r8 result: 2 Except: n regID: r3 result: 11 Except: n regID: f1 result: ? Except: y Time Initial Conditions - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H T regID: r4 result: 5 Except: n regID: r3 result: 11 Except: n regID: f1 result: ? Except: y H T

  8. Reorder Buffer Example ROB Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 HT first inst of fault handler Time Initial Conditions - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H T

  9. There is more complexity here Rename table needs to be cleared Everything is in the ARF Really do need to finish everything which was before the faulting instruction in program order. What about branches? Would need to drain everything before the branch. Why not just squash everything that follows it?

  10. And while were at it Does the ROB replace the RS? Is this a good thing? Bad thing?

  11. ROB ROB ROB is an in-order queue where instructions are placed. Instructions complete (retire) in-order Instructions still execute out-of-order Still use RS Instructions are issued to RS and ROB at the same time Rename is to ROB entry, not RS. When execute done instruction leaves RS Only when all instructions in before it in program order are done does the instruction retire.

  12. Adding a Reorder Buffer

  13. CDB T Tomasulo Data Structures (Timing Free Example, P6 scheme ) V Map Table Reg Tag r0 r1 r2 r3 r4 Reservation Stations (RS) T FU busy op 1 2 3 4 5 ARF Reg V r0 r1 r2 r3 r4 RoB T1 T2 V1 V2 Instruction r0=r1*r2 r1=r2*r3 Branch if r1=0 r0=r1+r1 r2=r2+1 Reorder Buffer (RoB) RoB Number Dest. Reg. Value 0 1 2 3 4 5 6

  14. Review Questions Could we make this work without the RS? If so, why do we do that? Why is it important to retire in order? Why must branches wait until retirement before they announce their mispredict? Any other ways to do this?

  15. More review questions 1. What is the purpose of the RoB? 2. Why do we have both a RoB and a RS? Yes, that was pretty much on the last page 3. Misprediction a) When to we resolve a mis-prediction? b) What happens to the main structures (RS, RoB, ARF, Rename Table) when we mispredict? 4. What is the whole purpose of OoO execution?

  16. And yet more review questions! 1. What is the purpose of the RoB? 2. Why do we have both a RoB and a RS? 3. Misprediction a) When to we resolve a mis-prediction? b) What happens to the main structures (RS, RoB, ARF, Rename Table) when we mispredict? 4. What is the whole purpose of OoO execution?

  17. When an instruction is dispatched how does it impact each major structure? Rename table? ARF? RoB? RS?

  18. When an instruction completes execution how does it impact each major structure? Rename table? ARF? RoB? RS?

  19. When an instruction retires how does it impact each major structure? Rename table? ARF? RoB? RS?

  20. Topic change Why on earth are we doing this? Why do we think it helps? Homework 2 problems 5 and 6 made the argument. Only need to obey true data dependencies. Huge speedup potential.

  21. Optimizing CPU Performance Golden Rule: tCPU= Ninst*CPI*tCLK Given this, what are our options Reduce the number of instructions executed Reduce the cycles to execute an instruction Reduce the clock period Our first focus: Reducing CPI Approach: Instruction Level Parallelism (ILP)

  22. Why ILP? Requirements Parallelism Large window Limited control deps Eliminate false deps Find run-time deps Vs.

  23. How Much ILP is There? (Chapter 3.10)

  24. How Large Must the Window Be?

  25. ALU Operation GOOD, Branch BAD Expected Number of Branches Between Mispredicts E(X) ~ 1/(1-p) E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts

  26. How Accurate are Branch Predictors?

  27. Impact of Physical Storage Limitations Each instruction in flight must have storage for its result Really worse than this because of mispeculation

  28. Registers GOOD, Memory BAD Benefits of registers Well described deps Fast access Finite resource Memory loses these benefits for flexibility *p = *q = ? = *p

  29. Bottom Line for an Ambitious Design

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#