Embedded Computer Architecture - Instruction Level Parallel Architectures Overview

Embedded Computer Architecture
5SAI0
Instruction Level Parallel
Architectures
Part III (final part)
 
Henk Corporaal
www.ics.ele.tue.nl/~heco/courses/ECA
TUEindhoven
2021-2022
Topics on ILP architectures
 
Introduction, Hazards (short recap)
Out-Of-Order (OoO) execution 
(H&P sect 3.4 - 3.6)
Branch prediction 
(H&P sect 3.3+3.9)
Multiple issue 
(H&P sect 3.7 – 3.8)
Superscalar
VLIW & 
EPIC
TTA 
(not in H&P)
How much ILP is there?
What can the compiler do? 
(H&P sect 3.2)
Material 
Ch 3
 (H&P or Dubois, second part)
9/20/2024
ECA  H.Corporaal
2
Recap: Single Issue RISC vs Superscalar
9/20/2024
ECA  H.Corporaal
3
Recap: Single Issue RISC vs VLIW
9/20/2024
ECA  H.Corporaal
4
Recap: VLIW:
general concept
9/20/2024
5
A
 
V
L
I
W
 
a
r
c
h
i
t
e
c
t
u
r
e
w
i
t
h
 
7
 
F
U
s
VLIW alternative: Intel EPIC Architecture IA-64
 
Explicit Parallel Instruction Computer (EPIC)
IA-64 architecture -> Itanium, first realization 2001
 
Idea
: 
instead of packing a fixed number of operations into a large instruction,
keep the original instructions but add control bits indicating if current instruction can
be executed in parallel with previous one
.
 
Many Registers
:
128  64-bit int x bits, stack, rotating
128  82-bit floating point, rotating
64   1-bit boolean
8     64-bit branch target address
system control registers
See http://en.wikipedia.org/wiki/Itanium
9/20/2024
6
Itanium
organization
9/20/2024
7
EPIC Architecture: IA-64
 
Instructions grouped in 128-bit 
bundles
 (of 3 instr. each):
3 * 41-bit instruction
5 
template
 bits, indicate type and 
stop
 location
Each 41-bit instruction
starts with 4-bit opcode, and
ends with 6-bit guard (boolean) register-id
Supports speculative loads
(follows later)
9/20/2024
8
9/20/2024
9
Itanium 2:
McKinley
C
hip layout:
9/20/2024
10
EPIC Architecture advantages
 
EPIC allows for more binary compatibility then a plain VLIW:
Function unit assignment performed at run-time
=> 
Nr of FUs architectural 
invisible
Lock when FU results not available
=> 
FU pipeline latencies architectural 
invisible
 
 
Intel stopped Itanium in 2017: 
why?
See course and other websites for more info
(look at related material)
Architecture options
 
Other design choices, like:
Visible pipeline latencies (yes/no)
Visible FU-RF data transports (yes/no)
Visible local memory (= called Scratchpad) vs Cache
… and many more …
Architectural invisibility gives backwards binary compatibility !!
9/20/2024
ECA  H.Corporaal
11
9/20/2024
12
What did we talk about?
I
L
P
 
=
 
I
n
s
t
r
u
c
t
i
o
n
 
L
e
v
e
l
 
P
a
r
a
l
l
e
l
i
s
m
 
=
a
b
i
l
i
t
y
 
t
o
 
p
e
r
f
o
r
m
 
m
u
l
t
i
p
l
e
 
o
p
e
r
a
t
i
o
n
s
 
(
o
r
 
i
n
s
t
r
u
c
t
i
o
n
s
)
,
f
r
o
m
 
a
 
s
i
n
g
l
e
 
i
n
s
t
r
u
c
t
i
o
n
 
s
t
r
e
a
m
,
i
n
 
p
a
r
a
l
l
e
l
VLIW
evaluation
9/20/2024
13
 
13
Instruction memory
Instruction
fetch unit
Instruction
decode unit
FU-1
FU-2
FU-3
FU-4
FU-5
Register file
Data memory
CPU
Bypassing network
Control problem
O(N
2
) 
O(N)-O(N
2
) 
With N function units
VLIW evaluation
 
Strong points of VLIW:
Scalable (add more FUs)
Flexible (an FU can be almost anything; e.g. multimedia support)
 
Weak points:
With N FUs:
Bypassing complexity: 
O(N
2
)
Register file complexity: 
O(N)
Register file size: 
O(N
2
)
Register file design restricts FU flexibility
 
Solution: 
.................................................. ?
9/20/2024
14
9/20/2024
15
Solution
TTA: Transport Triggered Architecture
TTA: Transport Triggered Architecture
>
st
*
+
-
>
st
*
+
-
9/20/2024
16
Transport
Triggered
Architecture
General organization 
of a TTA
TTA structure; datapath details
9/20/2024
17
 
Note: TTAs have a lot in common with the Blocks CGRA from Lab 1
TTA hardware characteristics
 
Modular: building blocks easy to reuse
Very flexible and scalable
easy inclusion of Special Function Units (SFUs)
Very low complexity
> 50% reduction on # register ports
reduced bypass complexity (no associative matching)
up to 80 % reduction in bypass connectivity
trivial decoding
reduced register pressure
easy register file partitioning (a single port is enough!)
9/20/2024
18
TTA software characteristics
19
r1 
 add.o1; 
 
 
r2
 add.o2;   
  
add.r 
 r3
That does not
 look like an
 improvement !?!
+
o1
o2
r
Program TTAs
9/20/2024
20
 
How to do data operations ?
1. 
Transport of operands to FU
 
Operand move (s)
 Trigger move
2. 
Transport of results from FU
 
Result move (s)
 
How to do Control flow ?
1. 
Jumps:
  
#jump-address 
 
pc
2. 
Branch:
 
#displacement 
 
pcd
3. 
Call:
  
pc 
 r; #call-address 
 
pcd
 
Example 
 
Add r3,r1,r2
 
 
becomes
r1 
 Oint
 
 
// operand move to integer unit
r2 
 Tadd
 
// trigger move to integer unit
………….
  
// addition operation in progress
Rint 
 r3
 
 
// result move from integer unit
9/20/2024
21
Scheduling example
integer
RF
immediate
unit
integer
 ALU
integer
 ALU
load/store
    unit
TTA Instruction format
9/20/2024
22
General MOVE field:
g
:
 
g
u
a
r
d
 
s
p
e
c
i
f
i
e
r
i
:
 
i
m
m
e
d
i
a
t
e
 
s
p
e
c
i
f
i
e
r
s
r
c
:
 
s
o
u
r
c
e
d
s
t
:
 
d
e
s
t
i
n
a
t
i
o
n
Summary on ILP architecture
 
Superscalar
Binary compatible
Multi-issue, OoO, branch speculation
VLIW
Not compatible, requires re-compilation of source code
More energy efficient
TTA (and Blocks CGRA)
A kind of VLIW
Solves / relaxes remaining VLIW inefficiencies when scaling up
RF #ports reduction
Connectivity reduction
9/20/2024
ECA  H.Corporaal
23
Topics on ILP architectures
Introduction, Hazards (short recap)
Out-Of-Order (OoO) execution:
Branch prediction
Multiple issue
How much ILP is there?
Measuring ILP
Exceeding ILP limits
What can the compiler do?
Material 
Ch 3
 (H&P or Dubois, second part)
9/20/2024
ECA  H.Corporaal
24
How parallel should we make a
processor?
Depends on 
how much parallelism is
there in your application?
9/20/2024
ECA  H.Corporaal
25
Measuring available ILP: How?
 
Using existing compiler
Using trace analysis
Track all the real data dependencies (RaWs) of instructions from issue
window
register dependences
memory dependences
Check for correct branch prediction
if prediction correct continue
if wrong, flush schedule and start in next cycle
9/20/2024
ECA  H.Corporaal
26
Trace analysis
9/20/2024
ECA  H.Corporaal
27
 
How parallel can this code be executed?
Trace analysis
9/20/2024
ECA  H.Corporaal
28
 
M
a
x
 
I
L
P
 
=
 
S
p
e
e
d
u
p
 
=
 
L
s
e
r
i
a
l
 
/
 
L
p
a
r
a
l
l
e
l
 
 
=
 
1
6
 
/
 
6
 
=
 
2
.
7
 
I
s
 
t
h
i
s
 
t
h
e
 
m
a
x
i
m
u
m
?
Ideal Processor
 
Assumptions for ideal/perfect processor:
 
1. 
Register renaming
 
– infinite number of virtual registers => all register WAW & WAR
hazards avoided
 
2. 
Branch and Jump prediction
 
– Perfect => all program instructions available for
execution
 
3. 
Memory-address alias analysis
 
– addresses are known. A store can be moved before a
load provided addresses not equal
 
Also:
unlimited number of instructions issued/cycle (unlimited resources), and
unlimited instruction window
perfect caches
1 cycle latency for all instructions (FP *,/)
 
Programs were compiled using MIPS compiler with maximum optimization level
9/20/2024
ECA  H.Corporaal
29
Upper Limit to ILP: 
Ideal Processor
9/20/2024
ECA  H.Corporaal
30
I
n
t
e
g
e
r
:
 
1
8
 
-
 
6
0
F
P
:
 
7
5
 
-
 
1
5
0
I
P
C
Window Size and Branch Impact
Change from infinite instr. window to 
examine 2000
, and 
issue at most 64 
instructions
per cycle
9/20/2024
ECA  H.Corporaal
31
F
P
:
 
1
5
 
-
 
4
5
I
n
t
e
g
e
r
:
 
6
 
 
1
2
 
 
 
 
 
 
 
I
P
C
 
 
 
 
 
 
 
 
P
e
r
f
e
c
t
 
 
 
T
o
u
r
n
a
m
e
n
t
 
 
 
B
H
T
(
5
1
2
)
 
 
 
P
r
o
f
i
l
e
 
 
 
N
o
 
p
r
e
d
i
c
t
i
o
n
Impact of Limited Renaming Registers
Assume: 2000 instr. window, 64 instr. issue, 8K 2-level predictor
(slightly better than tournament predictor)
9/20/2024
ECA  H.Corporaal
32
I
n
t
e
g
e
r
:
 
5
 
-
 
1
5
F
P
:
 
1
1
 
-
 
4
5
 
 
 
 
 
 
 
 
 
 
I
P
C
 
 
 
 
 
 
 
 
 
I
n
f
i
n
i
t
e
 
 
 
2
5
6
 
 
 
1
2
8
 
 
 
6
4
 
 
 
3
2
Memory Address Alias Impact
Assume:  2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming
registers
9/20/2024
ECA  H.Corporaal
33
F
P
:
 
4
 
-
 
4
5
(
F
o
r
t
r
a
n
,
n
o
 
h
e
a
p
)
I
n
t
e
g
e
r
:
 
4
 
-
 
9
I
P
C
 
 
 
 
P
e
r
f
e
c
t
 
 
 
G
l
o
b
a
l
/
s
t
a
c
k
 
p
e
r
f
e
c
t
 
 
 
I
n
s
p
e
c
t
i
o
n
 
 
 
N
o
n
e
Memory
disambiguation
quality:
Window Size Impact
Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64
renaming registers, 
issue as many as in window
9/20/2024
ECA  H.Corporaal
34
I
n
t
e
g
e
r
:
 
6
 
-
 
1
2
F
P
:
 
8
 
-
 
4
5
I
P
C
Window and
issue size:
How to Exceed ILP Limits of this Study?
 
Solve WAR and WAW hazards through memory:
eliminated WAW and WAR hazards through register renaming, but not yet for
memory operands
 
Avoid unnecessary dependences
E.g., compiler did not unroll loops so we keep the iteration index variable
dependence; e.g. in i=i+1
 
Overcoming the data flow limit: 
value prediction = 
predicting values and
speculating on prediction
E.g. predict next load address and speculate by reordering loads and stores.
9/20/2024
ECA  H.Corporaal
35
Topics on ILP architectures
Introduction, Hazards (short recap)
Out-Of-Order (OoO) execution:
Branch prediction
Multiple issue
How much ILP is there?
Measuring ILP
Exceeding ILP limits
What can the compiler do?
Material 
Ch 3
 (H&P or Dubois, second part)
9/20/2024
ECA  H.Corporaal
36
What can the compiler do?
 
Loop transformations
Code scheduling
 
Special course 5LIM0
on Parallelization and
Compilers
LLVM based
9/20/2024
ECA  H.Corporaal
37
An oldy on
compilers
Basic compiler techniques
Dependencies limit ILP (Instruction-Level Parallelism)
We can not always find sufficient independent operations to fill all the delay slots
May result in pipeline stalls
Scheduling
 to avoid stalls (= reorder instructions)
(Source-)code transformations 
to create more exploitable parallelism
Loop Unrolling
Loop Merging (Fusion)
see online slide-set about loop transformations !!
9/20/2024
ECA  H.Corporaal
38
9/20/2024
ECA  H.Corporaal
39
Dependencies Limit ILP: Example
Schedule this on an example processor
 
FP operations are mostly 
multicycle
The pipeline must be 
stalled
 if an instruction uses the result of a not yet
finished multicycle operation
We’ll assume the following latencies:
 
Producing
  
Consuming
  
Latency
 
instruction
  
instruction
  
(clock cycles)
 
FP ALU op
  
FP ALU op
  
3
 
FP ALU op
  
Store double
  
2
 
Load double
  
FP ALU op
  
1
 
Load double
  
Store double
  
0
9/20/2024
ECA  H.Corporaal
40
9/20/2024
ECA  H.Corporaal
41
Where to Insert Stalls?
How would this loop be executed on the 
MIPS
 FP pipeline?
 
What are the true (flow) dependences?
3 Inter-iteration
dependences !!
3 Intra-iteration
dependences !!
9/20/2024
ECA  H.Corporaal
42
Where to Insert Stalls
 
How would this loop be executed on the 
MIPS 
FP pipeline?
10 cycles per iteration
Producing
 
           Consuming
 
Latency
instruction
 
           instruction       ( cycles)
FP ALU 
 
          FP ALU op
 
3
FP ALU 
 
          Store double   
 
2
Load double        FP ALU
  
1
Load double        Store double   
 
0
9/20/2024
ECA  H.Corporaal
43
Code Scheduling to Avoid Stalls
 
Can we reorder the order of instruction to avoid stalls?
Execution time reduced from 10 to 6 cycles per iteration
 
 
 
 
 
 
 
But only 3 instructions perform useful work, rest is loop overhead !!
How to avoid this ???
watch out!
Loop
 
Unrolling
: increasing ILP
 
At source level:
for (i=1; i<=1000; i++)
   x[i] = x[i] + s;
 
 
for (i=1; i<=1000; i=i
+4
)
{
   x[i]   = x[i] + s;
   x[i+1] = x[i+1]+s;
   x[i+2] = x[i+2]+s;
   x[i+3] = x[i+3]+s;
}
14 cycles/4 iterations => 3.5 cycles/iteration
Any drawbacks?
loop unrolling increases code size
more registers needed
9/20/2024
ECA  H.Corporaal
44
 
Code after scheduling:
Loop: L
.
D    F0,0(R1)
      L
.
D    F6,8(R1)
      L
.
D    F10,16(R1)
      L
.
D    F14,24(R1)
      ADD
.
D  F4,F0,F2
      ADD
.
D
  
F8,F6,F2
      ADD
.
D  F12,F10,F2
      ADD
.
D  F16,F14,F2
      S
.
D    0(R1),F4
      S
.
D    8(R1),F8
      ADDI 
  
R1,R1,32
      SD
 
    
-16(R1),F12
      BNE    R1,R2,Loop
      SD
 
    
-8(R1),F16
Strip Mining
 
Split the iteration space in chunks (strips) of k iterations each
Strip can be completely 
unrolled
 (as in previous slide)
Strip can be 
vectorized
 (on SIMD or GPU architecture)
What if unknown number of loop iterations?
Number of iterations = 
n
Goal:  make 
k
 copies of the loop body
Solution
:
Generate pair of loops:
First executes 
n
 mod 
k
 times
Second executes 
n
 / 
k
 times
“Strip mining”
9/20/2024
ECA  H.Corporaal
45
Hardware support for compile-time scheduling
Predication
see earlier slides on conditional move and
 
if-conversion
Speculative loads
Deferred exceptions
9/20/2024
ECA  H.Corporaal
46
9/20/2024
ECA  H.Corporaal
47
Deferred Exceptions
 
What if this load generates a page fault?
What if this load generates an “index-out-of-bounds” exception?
    ld   r1,0(r3)
 
# load A
    
ld   r9,0(r2)
 
# speculative load B
    beqz r1,L3
 
# test A
    addi r9,r1,4
 
# 
else part
L3: st   r9,0(r3)
 
# store A
if (A==0)
   A = B;
else
   A = A+4;
 
ld   r1,0(r3) # load A
 
   
 
bnez r1,L1    # test A
 
ld   r1,0(r2) # 
then part
; load B
 
j    L2
L1:
 
addi r1,r1,4  # 
else part
; inc A
L2:
 
st   r1,0(r3) # store A
 
   How to optimize when 
then-part (A = B;)
    is usually selected? => 
Load B before the branch
HW supporting Speculative Loads
2 new instructions:
Speculative load
 (
sld
): does not generate exceptions
Speculation check instruction
 (
speck
): check for exception. The exception occurs
when this instruction is executed.
9/20/2024
ECA  H.Corporaal
48
  
ld    r1,0(r3) # load A
  
sld
   r9,0(r2) # speculative load of B
  
bnez  r1,L1    # test A
  
speck
 0(r2)    # perform exception check
  
j     L2
L1:
 
addi  r9,r1,4  # else part
L2:
 
st    r9,0(r3) # store A
What did you learn?
9/20/2024
49
O-O-O execution mechanisms
Multi-issue
VLIW vs Superscalar
Advanced branch prediction techniques
HW and SW speculation techniques
TTAs
Measuring ILP in programs
Compiler tricks
And.... many illustrative examples
Slide Note
Embed
Share

This material provides an in-depth look into Instruction Level Parallel (ILP) architectures, covering topics such as hazards, out-of-order execution, branch prediction, and multiple issue architectures. It compares Single-Issue RISC with Superscalar and VLIW architectures, discussing their differences in instruction execution. Furthermore, it delves into the concept of Very Long Instruction Word (VLIW) architecture, including a specific focus on Intel's Explicit Parallel Instruction Computer (EPIC) architecture.

  • Embedded Computer Architecture
  • ILP Architectures
  • Superscalar
  • VLIW
  • EPIC

Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Embedded Computer Architecture 5SAI0 Instruction Level Parallel Architectures Part III (final part) Henk Corporaal www.ics.ele.tue.nl/~heco/courses/ECA TUEindhoven 2021-2022

  2. Topics on ILP architectures Introduction, Hazards (short recap) Out-Of-Order (OoO) execution (H&P sect 3.4 - 3.6) Branch prediction (H&P sect 3.3+3.9) Multiple issue (H&P sect 3.7 3.8) Superscalar VLIW & EPIC TTA (not in H&P) How much ILP is there? What can the compiler do? (H&P sect 3.2) Material Ch 3 (H&P or Dubois, second part) 9/20/2024 ECA H.Corporaal 2

  3. Recap: Single Issue RISC vs Superscalar op op op op op op op op op op op op instr instr instr instr instr instr instr instr instr instr instr instr op op op op op op op op op op op op instr instr instr instr instr instr instr instr instr instr instr instr Change HW, but can use same code issue and (try to) execute 3 instr/cycle execute 1 instr/cycle 3-issue Superscalar (1-issue) RISC CPU 9/20/2024 ECA H.Corporaal 3

  4. Recap: Single Issue RISC vs VLIW op op op op op op op op op op op op instr instr instr instr instr instr instr instr instr instr instr instr instr instr instr instr instr op nop op op op op op op nop op op op nop op op Compiler execute 1 instr/cycle 3 ops/cycle execute 1 instr/cycle 3-issue VLIW RISC CPU 9/20/2024 ECA H.Corporaal 4

  5. Recap: VLIW: general concept Instruction Memory Instruction register A VLIW architecture with 7 FUs Function units Int FU Int FU Int FU LD/ST LD/ST FP FU FP FU Floating Point Register File Int Register File Data Memory 9/20/2024 5

  6. VLIW alternative: Intel EPIC Architecture IA-64 Explicit Parallel Instruction Computer (EPIC) IA-64 architecture -> Itanium, first realization 2001 Idea: instead of packing a fixed number of operations into a large instruction, keep the original instructions but add control bits indicating if current instruction can be executed in parallel with previous one. Many Registers: 128 64-bit int x bits, stack, rotating 128 82-bit floating point, rotating 64 1-bit boolean 8 64-bit branch target address system control registers See http://en.wikipedia.org/wiki/Itanium 9/20/2024 6

  7. Itanium organization 9/20/2024 7

  8. EPIC Architecture: IA-64 Instructions grouped in 128-bit bundles (of 3 instr. each): 3 * 41-bit instruction 5 template bits, indicate type and stop location Each 41-bit instruction starts with 4-bit opcode, and ends with 6-bit guard (boolean) register-id Supports speculative loads (follows later) 9/20/2024 8

  9. Itanium 2: McKinley Chip layout: 9/20/2024 9

  10. EPIC Architecture advantages EPIC allows for more binary compatibility then a plain VLIW: Function unit assignment performed at run-time => Nr of FUs architectural invisible Lock when FU results not available => FU pipeline latencies architectural invisible Mul r2,r5,r6 // 2-cycles . Add r1,r2,r3 // r2 ready // other operations Intel stopped Itanium in 2017: why? See course and other websites for more info (look at related material) 9/20/2024 10

  11. Architecture options Bind operations (where) Architecture options Schedule operations (when) Static Dynamic Static VLIW TRIPS Dynamic EPIC Superscalar Other design choices, like: Visible pipeline latencies (yes/no) Visible FU-RF data transports (yes/no) Visible local memory (= called Scratchpad) vs Cache and many more Architectural invisibility gives backwards binary compatibility !! 9/20/2024 ECA H.Corporaal 11

  12. What did we talk about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture Example Instruction format (5-issue): operation 1 operation 2 operation 3 operation 4 operation 5 9/20/2024 12

  13. VLIW evaluation Instruction memory Instruction fetch unit CPU Control problem Instruction decode unit FU-5 FU-4 FU-3 FU-2 FU-1 With N function units O(N2) Bypassing network O(N)-O(N2) Register file 13 Data memory 9/20/2024 13

  14. VLIW evaluation Strong points of VLIW: Scalable (add more FUs) Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: With N FUs: Bypassing complexity: O(N2) Register file complexity: O(N) Register file size: O(N2) Register file design restricts FU flexibility Solution: .................................................. ? 9/20/2024 14

  15. Solution Mirroring the Programming Paradigm TTA: Transport Triggered Architecture + - + - > * > * st st 9/20/2024 15

  16. Instruction memory Transport Triggered Architecture Instruction fetch unit CPU Instruction decode unit General organization of a TTA Bypassing network FU-5 FU-4 FU-3 FU-2 FU-1 Register file Data memory 9/20/2024 16

  17. TTA structure; datapath details Data Memory load/store unit load/store unit integer ALU integer ALU float ALU Socket integer RF float RF boolean RF instruct. unit immediate unit Instruction Memory Note: TTAs have a lot in common with the Blocks CGRA from Lab 1 9/20/2024 17

  18. TTA hardware characteristics Modular: building blocks easy to reuse Very flexible and scalable easy inclusion of Special Function Units (SFUs) Very low complexity > 50% reduction on # register ports reduced bypass complexity (no associative matching) up to 80 % reduction in bypass connectivity trivial decoding reduced register pressure easy register file partitioning (a single port is enough!) 9/20/2024 18

  19. TTA software characteristics add r3, r1, r2 That does not look like an improvement !?! r1 add.o1; r2 add.o2; o1 o2 + add.r r3 r 19

  20. Program TTAs How to do data operations ? 1. Transport of operands to FU Operand move (s) Trigger move 2. Transport of results from FU Result move (s) Trigger Operand Internal stage Example r1 Oint r2 Tadd . Rint r3 Add r3,r1,r2 // operand move to integer unit // trigger move to integer unit // addition operation in progress // result move from integer unit becomes Result How to do Control flow ? 1. Jumps: 2. Branch: 3. Call: FU Pipeline #jump-address pc #displacement pcd pc r; #call-address pcd 9/20/2024 20

  21. Scheduling example VLIW load/store unit integer ALU integer ALU add r1,r1,r2 sub r4,r1,95 TTA r1 -> add.o1, r2 -> add.o2 add.r -> sub.o1, 95 -> sub.o2 integer RF immediate unit sub.r -> r4 9/20/2024 21

  22. TTA Instruction format General MOVE field: g : guard specifier i : immediate specifier src : source dst : destination g i src dst General MOVE instructions: multiple fields move 1 move 2 move 3 move 4 How to use immediates? Small, 6 bits g 1 imm dst Long, 32 bits g 0 Ir-1 dst imm 9/20/2024 22

  23. Summary on ILP architecture Superscalar Binary compatible Multi-issue, OoO, branch speculation VLIW Not compatible, requires re-compilation of source code More energy efficient TTA (and Blocks CGRA) A kind of VLIW Solves / relaxes remaining VLIW inefficiencies when scaling up RF #ports reduction Connectivity reduction 9/20/2024 ECA H.Corporaal 23

  24. Topics on ILP architectures Introduction, Hazards (short recap) Out-Of-Order (OoO) execution: Branch prediction Multiple issue How much ILP is there? Measuring ILP Exceeding ILP limits What can the compiler do? Material Ch 3 (H&P or Dubois, second part) 9/20/2024 ECA H.Corporaal 24

  25. How parallel should we make a processor? Depends on how much parallelism is there in your application? 9/20/2024 ECA H.Corporaal 25

  26. Measuring available ILP: How? Using existing compiler Using trace analysis Track all the real data dependencies (RaWs) of instructions from issue window register dependences memory dependences Check for correct branch prediction if prediction correct continue if wrong, flush schedule and start in next cycle 9/20/2024 ECA H.Corporaal 26

  27. Trace analysis Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) Compiled code add r1,r1,1 set r1,0 add r3,r3,4 set r2,3 Program brne r1,r2,Loop set r3,&A For i := 0..2 st r1,0(r3) Loop: st r1,0(r3) A[i] := i; add r1,r1,1 add r1,r1,1 S := X+3; add r3,r3,4 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop st r1,0(r3) add r1,r5,3 add r1,r1,1 add r3,r3,4 brne r1,r2,Loop How parallel can this code be executed? add r1,r5,3 9/20/2024 ECA H.Corporaal 27

  28. Trace analysis Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7 Is this the maximum? 9/20/2024 ECA H.Corporaal 28

  29. Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction Perfect => all program instructions available for execution 3. Memory-address alias analysis addresses are known. A store can be moved before a load provided addresses not equal Also: unlimited number of instructions issued/cycle (unlimited resources), and unlimited instruction window perfect caches 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level 9/20/2024 ECA H.Corporaal 29

  30. Upper Limit to ILP: Ideal Processor Integer: 18 - 60 FP: 75 - 150 160 150.1 140 118.7 120 Instruction Issues per cycle 100 75.2 IPC 80 62.6 54.8 60 40 17.9 20 0 gcc espresso li fpppp doducd tomcatv Programs 9/20/2024 ECA H.Corporaal 30

  31. Window Size and Branch Impact Change from infinite instr. window to examine 2000, and issue at most 64 instructions per cycle 60 61 60 58 FP: 15 - 45 50 48 Integer: 6 12 46 46 45 45 45 IPC 41 Instruction issues per cycle 40 35 30 29 19 20 16 15 14 13 12 10 10 9 7 7 6 6 6 6 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Tournament BHT(512) Profile No prediction Perfect Selective predictor Standard 2-bit Static None 9/20/2024 ECA H.Corporaal 31

  32. Impact of Limited Renaming Registers Assume: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) 70 FP: 11 - 45 Integer: 5 - 15 59 60 54 IPC 49 50 Instruction issues per cycle 45 44 40 35 29 30 28 20 20 16 15 15 15 13 12 12 12 11 11 11 10 10 10 9 10 7 6 5 5 5 5 5 5 5 4 4 4 0 gcc espresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 Infinite 256 128 64 32 None 9/20/2024 ECA H.Corporaal 32

  33. Memory Address Alias Impact Assume: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers 49 49 50 45 45 45 FP: 4 - 45 (Fortran, no heap) 40 35 Instruction issues per cycle 30 25 IPC Integer: 4 - 9 20 16 16 15 15 12 10 9 10 7 7 6 5 5 5 4 4 4 4 4 3 3 3 5 0 gcc espresso li fpppp doducd tomcatv Memory disambiguation quality: Program Perfect Perfect Global/stack perfect Inspection None Global/stack Perfect Inspection None 9/20/2024 ECA H.Corporaal 33

  34. Window Size Impact Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as in window 60 56 52 FP: 8 - 45 50 47 45 40 Instruction issues per cycle 35 34 30 IPC 22 22 Integer: 6 - 12 20 17 16 15 15 15 14 14 13 12 12 12 11 11 10 10 10 10 9 9 9 9 8 8 10 8 7 6 6 6 6 5 4 4 4 4 3 3 3 3 3 2 0 gcc expresso li fpppp doducd tomcatv Window and issue size: Program Infinite 256 128 64 32 16 8 4 9/20/2024 ECA H.Corporaal 34

  35. How to Exceed ILP Limits of this Study? Solve WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not yet for memory operands Avoid unnecessary dependences E.g., compiler did not unroll loops so we keep the iteration index variable dependence; e.g. in i=i+1 Overcoming the data flow limit: value prediction = predicting values and speculating on prediction E.g. predict next load address and speculate by reordering loads and stores. 9/20/2024 ECA H.Corporaal 35

  36. Topics on ILP architectures Introduction, Hazards (short recap) Out-Of-Order (OoO) execution: Branch prediction Multiple issue How much ILP is there? Measuring ILP Exceeding ILP limits What can the compiler do? Material Ch 3 (H&P or Dubois, second part) 9/20/2024 ECA H.Corporaal 36

  37. What can the compiler do? Loop transformations Code scheduling Special course 5LIM0 on Parallelization and Compilers LLVM based An oldy on compilers 9/20/2024 ECA H.Corporaal 37

  38. Basic compiler techniques Dependencies limit ILP (Instruction-Level Parallelism) We can not always find sufficient independent operations to fill all the delay slots May result in pipeline stalls Scheduling to avoid stalls (= reorder instructions) (Source-)code transformations to create more exploitable parallelism Loop Unrolling Loop Merging (Fusion) see online slide-set about loop transformations !! 9/20/2024 ECA H.Corporaal 38

  39. Dependencies Limit ILP: Example C loop: for (i=1; i<=1000; i++) x[i] = x[i] + s; MIPS assembly code: ; R1 = &x[1] ; R2 = &x[1000]+8 ; F2 = s // compiler choices Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D 0(R1),F4 ADDI R1,R1,8 BNE R1,R2,Loop ; F0 = x[i] ; F4 = x[i]+s ; x[i] = F4 ; R1 = &x[i+1] ; branch if R1!=&x[1000]+8 9/20/2024 ECA H.Corporaal 39

  40. Schedule this on an example processor FP operations are mostly multicycle The pipeline must be stalled if an instruction uses the result of a not yet finished multicycle operation We ll assume the following latencies: Producing Consuming instruction instruction FP ALU op FP ALU op FP ALU op Store double Load double FP ALU op Load double Store double Latency (clock cycles) 3 2 1 0 9/20/2024 ECA H.Corporaal 40

  41. Where to Insert Stalls? How would this loop be executed on the MIPS FP pipeline? 3 Intra-iteration dependences !! 3 Inter-iteration dependences !! Loop: ADD.D F4,F0,F2 S.D F4,0(R1) ADDI R1,R1,8 BNE R1,R2,Loop L.D F0,0(R1) What are the true (flow) dependences? 9/20/2024 ECA H.Corporaal 41

  42. Where to Insert Stalls How would this loop be executed on the MIPS FP pipeline? 10 cycles per iteration Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D 0(R1),F4 ADDI R1,R1,8 stall BNE R1,R2,Loop stall Producing Consuming instruction instruction ( cycles) FP ALU FP ALU op FP ALU Store double 2 Load double FP ALU Load double Store double 0 Latency 3 1 9/20/2024 ECA H.Corporaal 42

  43. Code Scheduling to Avoid Stalls Can we reorder the order of instruction to avoid stalls? Execution time reduced from 10 to 6 cycles per iteration Loop: L.D F0,0(R1) ADDI R1,R1,8 ADD.D F4,F0,F2 stall BNE R1,R2,Loop S.D -8(R1),F4 watch out! But only 3 instructions perform useful work, rest is loop overhead !! How to avoid this ??? 9/20/2024 ECA H.Corporaal 43

  44. LoopUnrolling: increasing ILP At source level: for (i=1; i<=1000; i++) x[i] = x[i] + s; Code after scheduling: Loop: L.D F0,0(R1) L.D F6,8(R1) L.D F10,16(R1) L.D F14,24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 S.D 8(R1),F8 ADDI SD -16(R1),F12 BNE R1,R2,Loop SD -8(R1),F16 for (i=1; i<=1000; i=i+4) { x[i] = x[i] + s; x[i+1] = x[i+1]+s; x[i+2] = x[i+2]+s; x[i+3] = x[i+3]+s; } 14 cycles/4 iterations => 3.5 cycles/iteration Any drawbacks? loop unrolling increases code size more registers needed R1,R1,32 9/20/2024 ECA H.Corporaal 44

  45. Strip Mining Split the iteration space in chunks (strips) of k iterations each Strip can be completely unrolled (as in previous slide) Strip can be vectorized (on SIMD or GPU architecture) What if unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Solution: Generate pair of loops: First executes n mod k times Second executes n / k times Strip mining 9/20/2024 ECA H.Corporaal 45

  46. Hardware support for compile-time scheduling Predication see earlier slides on conditional move and if-conversion Speculative loads Deferred exceptions 9/20/2024 ECA H.Corporaal 46

  47. Deferred Exceptions ld r1,0(r3) # load A bnez r1,L1 # test A ld r1,0(r2) # then part; load B j L2 L1: addi r1,r1,4 # else part; inc A L2: st r1,0(r3) # store A if (A==0) A = B; else A = A+4; How to optimize when then-part (A = B;) is usually selected? => Load B before the branch ld r1,0(r3) ld r9,0(r2) beqz r1,L3 addi r9,r1,4 L3: st r9,0(r3) # load A # speculative load B # test A # else part # store A What if this load generates a page fault? What if this load generates an index-out-of-bounds exception? 9/20/2024 ECA H.Corporaal 47

  48. HW supporting Speculative Loads 2 new instructions: Speculative load (sld): does not generate exceptions Speculation check instruction (speck): check for exception. The exception occurs when this instruction is executed. ld r1,0(r3) # load A sld r9,0(r2) # speculative load of B bnez r1,L1 # test A speck 0(r2) # perform exception check j L2 addi r9,r1,4 # else part st r9,0(r3) # store A L1: L2: 9/20/2024 ECA H.Corporaal 48

  49. What did you learn? O-O-O execution mechanisms Multi-issue VLIW vs Superscalar Advanced branch prediction techniques HW and SW speculation techniques TTAs Measuring ILP in programs Compiler tricks And.... many illustrative examples 9/20/2024 49

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#