Embedded Computer Architecture - Instruction Level Parallel Architectures Overview

Embedded Computer Architecture

5SAI0

Instruction Level Parallel

Architectures

Part III (final part)

Henk Corporaal

www.ics.ele.tue.nl/~heco/courses/ECA

TUEindhoven

2021-2022

Topics on ILP architectures

•

Introduction, Hazards (short recap)

•

Out-Of-Order (OoO) execution

(H&P sect 3.4 - 3.6)

•

Branch prediction

(H&P sect 3.3+3.9)

•

Multiple issue

(H&P sect 3.7 – 3.8)

–

Superscalar

–

VLIW &

EPIC

–

TTA

(not in H&P)

•

How much ILP is there?

•

What can the compiler do?

(H&P sect 3.2)

•

Material

Ch 3

 (H&P or Dubois, second part)

9/20/2024

ECA  H.Corporaal

Recap: Single Issue RISC vs Superscalar

9/20/2024

ECA  H.Corporaal

Recap: Single Issue RISC vs VLIW

9/20/2024

ECA  H.Corporaal

Recap: VLIW:

general concept

9/20/2024

VLIW alternative: Intel EPIC Architecture IA-64

Explicit Parallel Instruction Computer (EPIC)

•

IA-64 architecture -> Itanium, first realization 2001

Idea

instead of packing a fixed number of operations into a large instruction,

keep the original instructions but add control bits indicating if current instruction can

be executed in parallel with previous one

Many Registers

•

128  64-bit int x bits, stack, rotating

•

128  82-bit floating point, rotating

•

64   1-bit boolean

•

8     64-bit branch target address

•

system control registers

See http://en.wikipedia.org/wiki/Itanium

9/20/2024

Itanium

organization

9/20/2024

EPIC Architecture: IA-64

•

Instructions grouped in 128-bit

bundles

 (of 3 instr. each):

–

3 * 41-bit instruction

–

template

 bits, indicate type and

stop

 location

•

Each 41-bit instruction

–

starts with 4-bit opcode, and

–

ends with 6-bit guard (boolean) register-id

•

Supports speculative loads

(follows later)

9/20/2024

9/20/2024

Itanium 2:

McKinley

hip layout:

9/20/2024

EPIC Architecture advantages

•

EPIC allows for more binary compatibility then a plain VLIW:

–

Function unit assignment performed at run-time

•

=>

Nr of FUs architectural

invisible

–

Lock when FU results not available

•

=>

FU pipeline latencies architectural

invisible

•

Intel stopped Itanium in 2017:

why?

–

See course and other websites for more info

–

(look at related material)

Architecture options

•

Other design choices, like:

–

Visible pipeline latencies (yes/no)

–

Visible FU-RF data transports (yes/no)

–

Visible local memory (= called Scratchpad) vs Cache

–

… and many more …

•

Architectural invisibility gives backwards binary compatibility !!

9/20/2024

ECA  H.Corporaal

9/20/2024

What did we talk about?

VLIW

evaluation

9/20/2024

Instruction memory

Instruction

fetch unit

Instruction

decode unit

FU-1

FU-2

FU-3

FU-4

FU-5

Register file

Data memory

CPU

Bypassing network

Control problem

O(N

O(N)-O(N

With N function units

VLIW evaluation

Strong points of VLIW:

–

Scalable (add more FUs)

–

Flexible (an FU can be almost anything; e.g. multimedia support)

Weak points:

•

With N FUs:

–

Bypassing complexity:

O(N

–

Register file complexity:

O(N)

–

Register file size:

O(N

•

Register file design restricts FU flexibility

Solution:

.................................................. ?

9/20/2024

9/20/2024

Solution

TTA: Transport Triggered Architecture

TTA: Transport Triggered Architecture

st

st

9/20/2024

Transport

Triggered

Architecture

General organization

of a TTA

TTA structure; datapath details

9/20/2024

Note: TTAs have a lot in common with the Blocks CGRA from Lab 1

TTA hardware characteristics

•

Modular: building blocks easy to reuse

•

Very flexible and scalable

–

easy inclusion of Special Function Units (SFUs)

•

Very low complexity

–

> 50% reduction on # register ports

–

reduced bypass complexity (no associative matching)

–

up to 80 % reduction in bypass connectivity

–

trivial decoding

–

reduced register pressure

–

easy register file partitioning (a single port is enough!)

9/20/2024

TTA software characteristics

r1



 add.o1;

r2



 add.o2;

add.r



r3

That does not

 look like an

 improvement !?!

o1

o2

Program TTAs

9/20/2024

How to do data operations ?

1.

Transport of operands to FU

•

Operand move (s)

•

 Trigger move

2.

Transport of results from FU

•

Result move (s)

How to do Control flow ?

1.

Jumps:

#jump-address



pc

2.

Branch:

#displacement



pcd

3.

Call:

pc



 r; #call-address



pcd

Example

Add r3,r1,r2

becomes

r1



 Oint

// operand move to integer unit

r2



 Tadd

// trigger move to integer unit

………….

// addition operation in progress

Rint



r3

// result move from integer unit

9/20/2024

Scheduling example

integer

RF

immediate

unit

integer

ALU

integer

ALU

load/store

    unit

TTA Instruction format

9/20/2024

General MOVE field:

Summary on ILP architecture

•

Superscalar

–

Binary compatible

–

Multi-issue, OoO, branch speculation

•

VLIW

–

Not compatible, requires re-compilation of source code

–

More energy efficient

•

TTA (and Blocks CGRA)

–

A kind of VLIW

–

Solves / relaxes remaining VLIW inefficiencies when scaling up

–

RF #ports reduction

–

Connectivity reduction

9/20/2024

ECA  H.Corporaal

Topics on ILP architectures

•

Introduction, Hazards (short recap)

•

Out-Of-Order (OoO) execution:

•

Branch prediction

•

Multiple issue

•

How much ILP is there?

–

Measuring ILP

–

Exceeding ILP limits

•

What can the compiler do?

•

Material

Ch 3

 (H&P or Dubois, second part)

9/20/2024

ECA  H.Corporaal

How parallel should we make a

processor?

Depends on

how much parallelism is

there in your application?

9/20/2024

ECA  H.Corporaal

Measuring available ILP: How?

•

Using existing compiler

•

Using trace analysis

–

Track all the real data dependencies (RaWs) of instructions from issue

window

•

register dependences

•

memory dependences

–

Check for correct branch prediction

•

if prediction correct continue

•

if wrong, flush schedule and start in next cycle

9/20/2024

ECA  H.Corporaal

Trace analysis

9/20/2024

ECA  H.Corporaal

How parallel can this code be executed?

Trace analysis

9/20/2024

ECA  H.Corporaal

Ideal Processor

Assumptions for ideal/perfect processor:

1.

Register renaming

– infinite number of virtual registers => all register WAW & WAR

hazards avoided

2.

Branch and Jump prediction

– Perfect => all program instructions available for

execution

3.

Memory-address alias analysis

– addresses are known. A store can be moved before a

load provided addresses not equal

Also:

–

unlimited number of instructions issued/cycle (unlimited resources), and

–

unlimited instruction window

–

perfect caches

–

1 cycle latency for all instructions (FP *,/)

Programs were compiled using MIPS compiler with maximum optimization level

9/20/2024

ECA  H.Corporaal

Upper Limit to ILP:

Ideal Processor

9/20/2024

ECA  H.Corporaal

Window Size and Branch Impact

•

Change from infinite instr. window to

examine 2000

, and

issue at most 64

instructions

per cycle

9/20/2024

ECA  H.Corporaal

–

Impact of Limited Renaming Registers

•

Assume: 2000 instr. window, 64 instr. issue, 8K 2-level predictor

(slightly better than tournament predictor)

9/20/2024

ECA  H.Corporaal

Memory Address Alias Impact

•

Assume:  2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming

registers

9/20/2024

ECA  H.Corporaal

Memory

disambiguation

quality:

Window Size Impact

•

Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64

renaming registers,

issue as many as in window

9/20/2024

ECA  H.Corporaal

Window and

issue size:

How to Exceed ILP Limits of this Study?

•

Solve WAR and WAW hazards through memory:

–

eliminated WAW and WAR hazards through register renaming, but not yet for

memory operands

•

Avoid unnecessary dependences

–

E.g., compiler did not unroll loops so we keep the iteration index variable

dependence; e.g. in i=i+1

•

Overcoming the data flow limit:

value prediction =

predicting values and

speculating on prediction

–

E.g. predict next load address and speculate by reordering loads and stores.

9/20/2024

ECA  H.Corporaal

Topics on ILP architectures

•

Introduction, Hazards (short recap)

•

Out-Of-Order (OoO) execution:

•

Branch prediction

•

Multiple issue

•

How much ILP is there?

–

Measuring ILP

–

Exceeding ILP limits

•

What can the compiler do?

•

Material

Ch 3

 (H&P or Dubois, second part)

9/20/2024

ECA  H.Corporaal

What can the compiler do?

•

Loop transformations

•

Code scheduling

•

Special course 5LIM0

on Parallelization and

Compilers

–

LLVM based

9/20/2024

ECA  H.Corporaal

An oldy on

compilers

Basic compiler techniques

•

Dependencies limit ILP (Instruction-Level Parallelism)

–

We can not always find sufficient independent operations to fill all the delay slots

–

May result in pipeline stalls

•

Scheduling

 to avoid stalls (= reorder instructions)

•

(Source-)code transformations

to create more exploitable parallelism

–

Loop Unrolling

–

Loop Merging (Fusion)

•

see online slide-set about loop transformations !!

9/20/2024

ECA  H.Corporaal

9/20/2024

ECA  H.Corporaal

Dependencies Limit ILP: Example

Schedule this on an example processor

•

FP operations are mostly

multicycle

•

The pipeline must be

stalled

 if an instruction uses the result of a not yet

finished multicycle operation

•

We’ll assume the following latencies:

Producing

Consuming

Latency

instruction

instruction

(clock cycles)

FP ALU op

FP ALU op

FP ALU op

Store double

Load double

FP ALU op

Load double

Store double

9/20/2024

ECA  H.Corporaal

9/20/2024

ECA  H.Corporaal

Where to Insert Stalls?

•

How would this loop be executed on the

MIPS

 FP pipeline?

What are the true (flow) dependences?

3 Inter-iteration

dependences !!

3 Intra-iteration

dependences !!

9/20/2024

ECA  H.Corporaal

Where to Insert Stalls

•

How would this loop be executed on the

MIPS

FP pipeline?

•

10 cycles per iteration

Producing

           Consuming

Latency

instruction

           instruction       ( cycles)

FP ALU

          FP ALU op

FP ALU

          Store double

Load double        FP ALU

Load double        Store double

9/20/2024

ECA  H.Corporaal

Code Scheduling to Avoid Stalls

•

Can we reorder the order of instruction to avoid stalls?

•

Execution time reduced from 10 to 6 cycles per iteration

•

But only 3 instructions perform useful work, rest is loop overhead !!

How to avoid this ???

watch out!

Loop

Unrolling

: increasing ILP

At source level:

for (i=1; i<=1000; i++)

   x[i] = x[i] + s;

for (i=1; i<=1000; i=i

+4

   x[i]   = x[i] + s;

   x[i+1] = x[i+1]+s;

   x[i+2] = x[i+2]+s;

   x[i+3] = x[i+3]+s;

•

14 cycles/4 iterations => 3.5 cycles/iteration

•

Any drawbacks?

–

loop unrolling increases code size

–

more registers needed

9/20/2024

ECA  H.Corporaal

Code after scheduling:

Loop: L

D    F0,0(R1)

D    F6,8(R1)

D    F10,16(R1)

D    F14,24(R1)

ADD

D  F4,F0,F2

ADD

F8,F6,F2

ADD

D  F12,F10,F2

ADD

D  F16,F14,F2

D    0(R1),F4

D    8(R1),F8

      ADDI

R1,R1,32

SD

-16(R1),F12

      BNE    R1,R2,Loop

SD

-8(R1),F16

Strip Mining

•

Split the iteration space in chunks (strips) of k iterations each

–

Strip can be completely

unrolled

 (as in previous slide)

–

Strip can be

vectorized

 (on SIMD or GPU architecture)

•

What if unknown number of loop iterations?

–

Number of iterations =

–

Goal:  make

 copies of the loop body

•

Solution

–

Generate pair of loops:

•

First executes

mod

 times

•

Second executes

 times

•

“Strip mining”

9/20/2024

ECA  H.Corporaal

Hardware support for compile-time scheduling

•

Predication

–

see earlier slides on conditional move and

if-conversion

•

Speculative loads

–

Deferred exceptions

9/20/2024

ECA  H.Corporaal

9/20/2024

ECA  H.Corporaal

Deferred Exceptions

•

What if this load generates a page fault?

•

What if this load generates an “index-out-of-bounds” exception?

    ld   r1,0(r3)

# load A

ld   r9,0(r2)

# speculative load B

    beqz r1,L3

# test A

    addi r9,r1,4

else part

L3: st   r9,0(r3)

# store A

if (A==0)

   A = B;

else

   A = A+4;

ld   r1,0(r3) # load A

bnez r1,L1    # test A

ld   r1,0(r2) #

then part

; load B

j    L2

L1:

addi r1,r1,4  #

else part

; inc A

L2:

st   r1,0(r3) # store A

•

   How to optimize when

then-part (A = B;)

    is usually selected? =>

Load B before the branch

HW supporting Speculative Loads

•

2 new instructions:

–

Speculative load

sld

): does not generate exceptions

–

Speculation check instruction

speck

): check for exception. The exception occurs

when this instruction is executed.

9/20/2024

ECA  H.Corporaal

ld    r1,0(r3) # load A

sld

   r9,0(r2) # speculative load of B

bnez  r1,L1    # test A

speck

 0(r2)    # perform exception check

j     L2

L1:

addi  r9,r1,4  # else part

L2:

st    r9,0(r3) # store A

What did you learn?

9/20/2024

•

O-O-O execution mechanisms

•

Multi-issue

•

VLIW vs Superscalar

•

Advanced branch prediction techniques

•

HW and SW speculation techniques

•

TTAs

•

Measuring ILP in programs

•

Compiler tricks

•

And.... many illustrative examples

Slide Note

Embed Share

Download

This material provides an in-depth look into Instruction Level Parallel (ILP) architectures, covering topics such as hazards, out-of-order execution, branch prediction, and multiple issue architectures. It compares Single-Issue RISC with Superscalar and VLIW architectures, discussing their differences in instruction execution. Furthermore, it delves into the concept of Very Long Instruction Word (VLIW) architecture, including a specific focus on Intel's Explicit Parallel Instruction Computer (EPIC) architecture.

ewish Follow

Uploaded on Sep 20, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Embedded Computer Architecture 5SAI0 Instruction Level Parallel Architectures Part III (final part) Henk Corporaal www.ics.ele.tue.nl/~heco/courses/ECA TUEindhoven 2021-2022

Topics on ILP architectures Introduction, Hazards (short recap) Out-Of-Order (OoO) execution (H&P sect 3.4 - 3.6) Branch prediction (H&P sect 3.3+3.9) Multiple issue (H&P sect 3.7 3.8) Superscalar VLIW & EPIC TTA (not in H&P) How much ILP is there? What can the compiler do? (H&P sect 3.2) Material Ch 3 (H&P or Dubois, second part) 9/20/2024 ECA H.Corporaal 2

Recap: Single Issue RISC vs Superscalar op op op op op op op op op op op op instr instr instr instr instr instr instr instr instr instr instr instr op op op op op op op op op op op op instr instr instr instr instr instr instr instr instr instr instr instr Change HW, but can use same code issue and (try to) execute 3 instr/cycle execute 1 instr/cycle 3-issue Superscalar (1-issue) RISC CPU 9/20/2024 ECA H.Corporaal 3

Recap: Single Issue RISC vs VLIW op op op op op op op op op op op op instr instr instr instr instr instr instr instr instr instr instr instr instr instr instr instr instr op nop op op op op op op nop op op op nop op op Compiler execute 1 instr/cycle 3 ops/cycle execute 1 instr/cycle 3-issue VLIW RISC CPU 9/20/2024 ECA H.Corporaal 4

Recap: VLIW: general concept Instruction Memory Instruction register A VLIW architecture with 7 FUs Function units Int FU Int FU Int FU LD/ST LD/ST FP FU FP FU Floating Point Register File Int Register File Data Memory 9/20/2024 5

VLIW alternative: Intel EPIC Architecture IA-64 Explicit Parallel Instruction Computer (EPIC) IA-64 architecture -> Itanium, first realization 2001 Idea: instead of packing a fixed number of operations into a large instruction, keep the original instructions but add control bits indicating if current instruction can be executed in parallel with previous one. Many Registers: 128 64-bit int x bits, stack, rotating 128 82-bit floating point, rotating 64 1-bit boolean 8 64-bit branch target address system control registers See http://en.wikipedia.org/wiki/Itanium 9/20/2024 6

Itanium organization 9/20/2024 7

EPIC Architecture: IA-64 Instructions grouped in 128-bit bundles (of 3 instr. each): 3 * 41-bit instruction 5 template bits, indicate type and stop location Each 41-bit instruction starts with 4-bit opcode, and ends with 6-bit guard (boolean) register-id Supports speculative loads (follows later) 9/20/2024 8

Itanium 2: McKinley Chip layout: 9/20/2024 9

EPIC Architecture advantages EPIC allows for more binary compatibility then a plain VLIW: Function unit assignment performed at run-time => Nr of FUs architectural invisible Lock when FU results not available => FU pipeline latencies architectural invisible Mul r2,r5,r6 // 2-cycles . Add r1,r2,r3 // r2 ready // other operations Intel stopped Itanium in 2017: why? See course and other websites for more info (look at related material) 9/20/2024 10

Architecture options Bind operations (where) Architecture options Schedule operations (when) Static Dynamic Static VLIW TRIPS Dynamic EPIC Superscalar Other design choices, like: Visible pipeline latencies (yes/no) Visible FU-RF data transports (yes/no) Visible local memory (= called Scratchpad) vs Cache and many more Architectural invisibility gives backwards binary compatibility !! 9/20/2024 ECA H.Corporaal 11

What did we talk about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture Example Instruction format (5-issue): operation 1 operation 2 operation 3 operation 4 operation 5 9/20/2024 12

VLIW evaluation Instruction memory Instruction fetch unit CPU Control problem Instruction decode unit FU-5 FU-4 FU-3 FU-2 FU-1 With N function units O(N2) Bypassing network O(N)-O(N2) Register file 13 Data memory 9/20/2024 13

VLIW evaluation Strong points of VLIW: Scalable (add more FUs) Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: With N FUs: Bypassing complexity: O(N2) Register file complexity: O(N) Register file size: O(N2) Register file design restricts FU flexibility Solution: .................................................. ? 9/20/2024 14

Solution Mirroring the Programming Paradigm TTA: Transport Triggered Architecture + - + - > * > * st st 9/20/2024 15

Instruction memory Transport Triggered Architecture Instruction fetch unit CPU Instruction decode unit General organization of a TTA Bypassing network FU-5 FU-4 FU-3 FU-2 FU-1 Register file Data memory 9/20/2024 16

TTA structure; datapath details Data Memory load/store unit load/store unit integer ALU integer ALU float ALU Socket integer RF float RF boolean RF instruct. unit immediate unit Instruction Memory Note: TTAs have a lot in common with the Blocks CGRA from Lab 1 9/20/2024 17

TTA hardware characteristics Modular: building blocks easy to reuse Very flexible and scalable easy inclusion of Special Function Units (SFUs) Very low complexity > 50% reduction on # register ports reduced bypass complexity (no associative matching) up to 80 % reduction in bypass connectivity trivial decoding reduced register pressure easy register file partitioning (a single port is enough!) 9/20/2024 18

TTA software characteristics add r3, r1, r2 That does not look like an improvement !?! r1 add.o1; r2 add.o2; o1 o2 + add.r r3 r 19

Program TTAs How to do data operations ? 1. Transport of operands to FU Operand move (s) Trigger move 2. Transport of results from FU Result move (s) Trigger Operand Internal stage Example r1 Oint r2 Tadd . Rint r3 Add r3,r1,r2 // operand move to integer unit // trigger move to integer unit // addition operation in progress // result move from integer unit becomes Result How to do Control flow ? 1. Jumps: 2. Branch: 3. Call: FU Pipeline #jump-address pc #displacement pcd pc r; #call-address pcd 9/20/2024 20

Scheduling example VLIW load/store unit integer ALU integer ALU add r1,r1,r2 sub r4,r1,95 TTA r1 -> add.o1, r2 -> add.o2 add.r -> sub.o1, 95 -> sub.o2 integer RF immediate unit sub.r -> r4 9/20/2024 21

TTA Instruction format General MOVE field: g : guard specifier i : immediate specifier src : source dst : destination g i src dst General MOVE instructions: multiple fields move 1 move 2 move 3 move 4 How to use immediates? Small, 6 bits g 1 imm dst Long, 32 bits g 0 Ir-1 dst imm 9/20/2024 22

Summary on ILP architecture Superscalar Binary compatible Multi-issue, OoO, branch speculation VLIW Not compatible, requires re-compilation of source code More energy efficient TTA (and Blocks CGRA) A kind of VLIW Solves / relaxes remaining VLIW inefficiencies when scaling up RF #ports reduction Connectivity reduction 9/20/2024 ECA H.Corporaal 23

Topics on ILP architectures Introduction, Hazards (short recap) Out-Of-Order (OoO) execution: Branch prediction Multiple issue How much ILP is there? Measuring ILP Exceeding ILP limits What can the compiler do? Material Ch 3 (H&P or Dubois, second part) 9/20/2024 ECA H.Corporaal 24

How parallel should we make a processor? Depends on how much parallelism is there in your application? 9/20/2024 ECA H.Corporaal 25

Measuring available ILP: How? Using existing compiler Using trace analysis Track all the real data dependencies (RaWs) of instructions from issue window register dependences memory dependences Check for correct branch prediction if prediction correct continue if wrong, flush schedule and start in next cycle 9/20/2024 ECA H.Corporaal 26

Trace analysis Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) Compiled code add r1,r1,1 set r1,0 add r3,r3,4 set r2,3 Program brne r1,r2,Loop set r3,&A For i := 0..2 st r1,0(r3) Loop: st r1,0(r3) A[i] := i; add r1,r1,1 add r1,r1,1 S := X+3; add r3,r3,4 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop st r1,0(r3) add r1,r5,3 add r1,r1,1 add r3,r3,4 brne r1,r2,Loop How parallel can this code be executed? add r1,r5,3 9/20/2024 ECA H.Corporaal 27

Trace analysis Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7 Is this the maximum? 9/20/2024 ECA H.Corporaal 28

Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction Perfect => all program instructions available for execution 3. Memory-address alias analysis addresses are known. A store can be moved before a load provided addresses not equal Also: unlimited number of instructions issued/cycle (unlimited resources), and unlimited instruction window perfect caches 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level 9/20/2024 ECA H.Corporaal 29

Upper Limit to ILP: Ideal Processor Integer: 18 - 60 FP: 75 - 150 160 150.1 140 118.7 120 Instruction Issues per cycle 100 75.2 IPC 80 62.6 54.8 60 40 17.9 20 0 gcc espresso li fpppp doducd tomcatv Programs 9/20/2024 ECA H.Corporaal 30

Window Size and Branch Impact Change from infinite instr. window to examine 2000, and issue at most 64 instructions per cycle 60 61 60 58 FP: 15 - 45 50 48 Integer: 6 12 46 46 45 45 45 IPC 41 Instruction issues per cycle 40 35 30 29 19 20 16 15 14 13 12 10 10 9 7 7 6 6 6 6 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Tournament BHT(512) Profile No prediction Perfect Selective predictor Standard 2-bit Static None 9/20/2024 ECA H.Corporaal 31

Impact of Limited Renaming Registers Assume: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) 70 FP: 11 - 45 Integer: 5 - 15 59 60 54 IPC 49 50 Instruction issues per cycle 45 44 40 35 29 30 28 20 20 16 15 15 15 13 12 12 12 11 11 11 10 10 10 9 10 7 6 5 5 5 5 5 5 5 4 4 4 0 gcc espresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 Infinite 256 128 64 32 None 9/20/2024 ECA H.Corporaal 32

Memory Address Alias Impact Assume: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers 49 49 50 45 45 45 FP: 4 - 45 (Fortran, no heap) 40 35 Instruction issues per cycle 30 25 IPC Integer: 4 - 9 20 16 16 15 15 12 10 9 10 7 7 6 5 5 5 4 4 4 4 4 3 3 3 5 0 gcc espresso li fpppp doducd tomcatv Memory disambiguation quality: Program Perfect Perfect Global/stack perfect Inspection None Global/stack Perfect Inspection None 9/20/2024 ECA H.Corporaal 33

Window Size Impact Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as in window 60 56 52 FP: 8 - 45 50 47 45 40 Instruction issues per cycle 35 34 30 IPC 22 22 Integer: 6 - 12 20 17 16 15 15 15 14 14 13 12 12 12 11 11 10 10 10 10 9 9 9 9 8 8 10 8 7 6 6 6 6 5 4 4 4 4 3 3 3 3 3 2 0 gcc expresso li fpppp doducd tomcatv Window and issue size: Program Infinite 256 128 64 32 16 8 4 9/20/2024 ECA H.Corporaal 34

How to Exceed ILP Limits of this Study? Solve WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not yet for memory operands Avoid unnecessary dependences E.g., compiler did not unroll loops so we keep the iteration index variable dependence; e.g. in i=i+1 Overcoming the data flow limit: value prediction = predicting values and speculating on prediction E.g. predict next load address and speculate by reordering loads and stores. 9/20/2024 ECA H.Corporaal 35

Topics on ILP architectures Introduction, Hazards (short recap) Out-Of-Order (OoO) execution: Branch prediction Multiple issue How much ILP is there? Measuring ILP Exceeding ILP limits What can the compiler do? Material Ch 3 (H&P or Dubois, second part) 9/20/2024 ECA H.Corporaal 36

What can the compiler do? Loop transformations Code scheduling Special course 5LIM0 on Parallelization and Compilers LLVM based An oldy on compilers 9/20/2024 ECA H.Corporaal 37

Basic compiler techniques Dependencies limit ILP (Instruction-Level Parallelism) We can not always find sufficient independent operations to fill all the delay slots May result in pipeline stalls Scheduling to avoid stalls (= reorder instructions) (Source-)code transformations to create more exploitable parallelism Loop Unrolling Loop Merging (Fusion) see online slide-set about loop transformations !! 9/20/2024 ECA H.Corporaal 38

Dependencies Limit ILP: Example C loop: for (i=1; i<=1000; i++) x[i] = x[i] + s; MIPS assembly code: ; R1 = &x[1] ; R2 = &x[1000]+8 ; F2 = s // compiler choices Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D 0(R1),F4 ADDI R1,R1,8 BNE R1,R2,Loop ; F0 = x[i] ; F4 = x[i]+s ; x[i] = F4 ; R1 = &x[i+1] ; branch if R1!=&x[1000]+8 9/20/2024 ECA H.Corporaal 39

Schedule this on an example processor FP operations are mostly multicycle The pipeline must be stalled if an instruction uses the result of a not yet finished multicycle operation We ll assume the following latencies: Producing Consuming instruction instruction FP ALU op FP ALU op FP ALU op Store double Load double FP ALU op Load double Store double Latency (clock cycles) 3 2 1 0 9/20/2024 ECA H.Corporaal 40

Where to Insert Stalls? How would this loop be executed on the MIPS FP pipeline? 3 Intra-iteration dependences !! 3 Inter-iteration dependences !! Loop: ADD.D F4,F0,F2 S.D F4,0(R1) ADDI R1,R1,8 BNE R1,R2,Loop L.D F0,0(R1) What are the true (flow) dependences? 9/20/2024 ECA H.Corporaal 41

Where to Insert Stalls How would this loop be executed on the MIPS FP pipeline? 10 cycles per iteration Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D 0(R1),F4 ADDI R1,R1,8 stall BNE R1,R2,Loop stall Producing Consuming instruction instruction ( cycles) FP ALU FP ALU op FP ALU Store double 2 Load double FP ALU Load double Store double 0 Latency 3 1 9/20/2024 ECA H.Corporaal 42

Code Scheduling to Avoid Stalls Can we reorder the order of instruction to avoid stalls? Execution time reduced from 10 to 6 cycles per iteration Loop: L.D F0,0(R1) ADDI R1,R1,8 ADD.D F4,F0,F2 stall BNE R1,R2,Loop S.D -8(R1),F4 watch out! But only 3 instructions perform useful work, rest is loop overhead !! How to avoid this ??? 9/20/2024 ECA H.Corporaal 43

LoopUnrolling: increasing ILP At source level: for (i=1; i<=1000; i++) x[i] = x[i] + s; Code after scheduling: Loop: L.D F0,0(R1) L.D F6,8(R1) L.D F10,16(R1) L.D F14,24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 S.D 8(R1),F8 ADDI SD -16(R1),F12 BNE R1,R2,Loop SD -8(R1),F16 for (i=1; i<=1000; i=i+4) { x[i] = x[i] + s; x[i+1] = x[i+1]+s; x[i+2] = x[i+2]+s; x[i+3] = x[i+3]+s; } 14 cycles/4 iterations => 3.5 cycles/iteration Any drawbacks? loop unrolling increases code size more registers needed R1,R1,32 9/20/2024 ECA H.Corporaal 44

Strip Mining Split the iteration space in chunks (strips) of k iterations each Strip can be completely unrolled (as in previous slide) Strip can be vectorized (on SIMD or GPU architecture) What if unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Solution: Generate pair of loops: First executes n mod k times Second executes n / k times Strip mining 9/20/2024 ECA H.Corporaal 45

Hardware support for compile-time scheduling Predication see earlier slides on conditional move and if-conversion Speculative loads Deferred exceptions 9/20/2024 ECA H.Corporaal 46

Deferred Exceptions ld r1,0(r3) # load A bnez r1,L1 # test A ld r1,0(r2) # then part; load B j L2 L1: addi r1,r1,4 # else part; inc A L2: st r1,0(r3) # store A if (A==0) A = B; else A = A+4; How to optimize when then-part (A = B;) is usually selected? => Load B before the branch ld r1,0(r3) ld r9,0(r2) beqz r1,L3 addi r9,r1,4 L3: st r9,0(r3) # load A # speculative load B # test A # else part # store A What if this load generates a page fault? What if this load generates an index-out-of-bounds exception? 9/20/2024 ECA H.Corporaal 47

HW supporting Speculative Loads 2 new instructions: Speculative load (sld): does not generate exceptions Speculation check instruction (speck): check for exception. The exception occurs when this instruction is executed. ld r1,0(r3) # load A sld r9,0(r2) # speculative load of B bnez r1,L1 # test A speck 0(r2) # perform exception check j L2 addi r9,r1,4 # else part st r9,0(r3) # store A L1: L2: 9/20/2024 ECA H.Corporaal 48

What did you learn? O-O-O execution mechanisms Multi-issue VLIW vs Superscalar Advanced branch prediction techniques HW and SW speculation techniques TTAs Measuring ILP in programs Compiler tricks And.... many illustrative examples 9/20/2024 49

Embedded Computer Architecture - Instruction Level Parallel Architectures Overview

Download Presentation

Presentation Transcript

Related

More Related Content