Understanding Computer Abstraction and Performance Metrics

 
C
O
S
C
 
3
4
0
6
:
 
C
o
m
p
u
t
e
r
 
O
r
g
a
n
i
z
a
t
i
o
n
 
L
e
c
t
u
r
e
 
3
:
 
C
o
m
p
u
t
e
r
 
A
b
s
t
r
a
c
t
i
o
n
 
K
a
l
p
d
r
u
m
 
P
a
s
s
i
F
a
l
l
 
2
0
1
6
(
 
w
w
w
.
c
s
.
l
a
u
r
e
n
t
i
a
n
.
c
a
/
k
p
a
s
s
i
/
c
o
s
c
3
4
0
6
.
h
t
m
l
 
)
 
C
h
a
p
t
e
r
 
1
 
C
o
m
p
u
t
e
r
 
A
b
s
t
r
a
c
t
i
o
n
s
a
n
d
 
T
e
c
h
n
o
l
o
g
y
 
I
n
s
t
r
u
c
t
i
o
n
 
C
o
u
n
t
 
a
n
d
 
C
P
I
 
°
Instruction Count for a program
Determined by program, ISA and compiler
°
Average cycles per instruction
Determined by CPU hardware
If different instructions have different CPI
-
Average CPI affected by instruction mix
 
C
P
I
 
E
x
a
m
p
l
e
 
°
Computer A: Cycle Time = 250ps, CPI = 2.0
°
Computer B: Cycle Time = 500ps, CPI = 1.2
°
Same ISA
°
Which is faster, and by how much?
A is faster…
…by this much
 
C
P
I
 
i
n
 
M
o
r
e
 
D
e
t
a
i
l
 
°
If different instruction classes take different
numbers of cycles
 
Weighted average CPI
Relative frequency
 
C
P
I
 
E
x
a
m
p
l
e
 
°
Alternative compiled code sequences using
instructions in classes A, B, C
 
Sequence 1: IC = 5
Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
Avg. CPI = 10/5 = 2.0
 
Sequence 2: IC = 6
Clock Cycles
= 4×1 + 1×2 + 1×3
= 9
Avg. CPI = 9/6 = 1.5
 
P
e
r
f
o
r
m
a
n
c
e
 
S
u
m
m
a
r
y
 
°
Performance depends on
Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, T
c
T
h
e
 
B
I
G
 
P
i
c
t
u
r
e
 
P
o
w
e
r
 
T
r
e
n
d
s
 
°
Power provides a limit to what we can cool,
°
In the post-PC era the really valuable resource is
energy.
°
Dominant technology for IC is CMOS
(complementary metal oxide semiconductor)
§1.7 The Power Wall
 
P
o
w
e
r
 
T
r
e
n
d
s
 
°
The primary source of energy consumption is so-called
dynamic energy
—that is, energy that is consumed
when transistors switch states from 0 to 1 and vice
versa.
°
The dynamic energy depends on the capacitive loading
of each transistor and the voltage applied:
 
°
This equation is the energy of a pulse during the logic
transition of 0 → 1 → 0 or 1 → 0 → 1.
°
The power required per transistor is just the product of
energy of a transition and the frequency of transitions:
×
30
5V → 1V
×
1000
 
R
e
d
u
c
i
n
g
 
P
o
w
e
r
 
°
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
 
The power wall
We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
 
U
n
i
p
r
o
c
e
s
s
o
r
 
P
e
r
f
o
r
m
a
n
c
e
§1.8 The Sea Change: The Switch to Multiprocessors
Constrained by power, instruction-level parallelism,
memory latency
 
M
u
l
t
i
p
r
o
c
e
s
s
o
r
s
 
°
Multicore microprocessors
More than one processor per chip
°
Requires explicitly parallel programming
Compare with instruction level parallelism
-
Hardware executes multiple instructions at
once
-
Hidden from the programmer
Hard to do
-
Programming for performance
-
Load balancing
-
Optimizing communication and synchronization
 
S
P
E
C
 
C
P
U
 
B
e
n
c
h
m
a
r
k
 
°
Programs used to measure performance
Supposedly typical of actual workload
°
System Performance Evaluation Cooperative
(SPEC)
Develops benchmarks for CPU, I/O, Web, …
°
SPEC CPU2006
Elapsed time to execute a selection of programs
-
Negligible I/O, so focuses on CPU performance
Dividing the execution time of a reference processor by
the execution time of the evaluated computer
normalizes the execution time measurements;
This normalization yields a measure, called the
SPECratio
SPECratio is the inverse of execution time.
 
C
I
N
T
2
0
0
6
 
f
o
r
 
I
n
t
e
l
 
C
o
r
e
 
i
7
 
9
2
0
 
o
A CINT2006 (integer) or CFP2006 (floating-point)
Summary measurement is obtained by taking the
geometric mean of 
SPECratio
s
 
S
P
E
C
 
P
o
w
e
r
 
B
e
n
c
h
m
a
r
k
 
°
Power consumption of servers at different workload levels,
divided into 10% increments, over a period of time.
°
SPECpower started with another SPEC benchmark for Java
business applications (SPECJBB2005),
°
It exercises the processors, caches, and main memory as
well as the Java virtual machine, compiler, garbage collector,
and pieces of the operating system.
°
Performance is measured in throughput, and the units are
business operations per second.
 
 
where ssj_ops
i
 is performance at each 10% increment and power
i
 is
power consumed at each performance level
Power: Watts (Joules/sec
)
 
S
P
E
C
p
o
w
e
r
_
s
s
j
2
0
0
8
 
f
o
r
 
X
e
o
n
 
X
5
6
5
0
 
P
i
t
f
a
l
l
:
 
A
m
d
a
h
l
s
 
L
a
w
 
°
Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
§1.10 Fallacies and Pitfalls
 
Can’t be done!
 
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to
get 5× overall?
 
Corollary: make the common case fast
 
F
a
l
l
a
c
y
:
 
L
o
w
 
P
o
w
e
r
 
a
t
 
I
d
l
e
 
°
Look back at i7 power benchmark
At 100% load: 258W
At 50% load: 170W (66%)
At 10% load: 121W (47%)
°
Google data center
Mostly operates at 10% – 50% load
At 100% load less than 1% of the time
°
Consider designing processors to make power
proportional to load
°
If future servers used, say, 10% of peak power
at 10% workload, we could reduce the
electricity bill of datacenters
 
F
a
l
l
a
c
y
:
 
D
e
s
i
g
n
i
n
g
 
f
o
r
 
p
e
r
f
o
r
m
a
n
c
e
 
a
n
d
 
d
e
s
i
g
n
i
n
g
 
f
o
r
 
e
n
e
r
g
y
 
e
f
f
i
c
i
e
n
c
y
 
u
n
r
e
l
a
t
e
d
 
°
Energy is power over time
°
Hardware or software optimizations that take
less time save energy overall even if the
optimization takes a bit more energy when it is
used.
 
P
i
t
f
a
l
l
:
 
M
I
P
S
 
a
s
 
a
 
P
e
r
f
o
r
m
a
n
c
e
 
M
e
t
r
i
c
 
°
MIPS: Millions of Instructions Per Second
Doesn’t account for
-
Differences in ISAs between computers
-
Differences in complexity between instructions
 
CPI varies between programs on a given CPU
and so does MIPS
 
C
o
n
c
l
u
d
i
n
g
 
R
e
m
a
r
k
s
 
°
Cost/performance is improving
Due to underlying technology development
°
Hierarchical layers of abstraction
In both hardware and software
°
Instruction set architecture
The hardware/software interface
°
Execution time: the best performance measure
 
°
Individually the factors do not determine
performance: only the product, is a reliable
measure of performance.
§1.9 Concluding Remarks
 
C
o
n
c
l
u
d
i
n
g
 
R
e
m
a
r
k
s
 
°
Two of the key ideas are
exploiting parallelism in the program, via multiple
processors, and
exploiting locality of accesses to a memory
hierarchy, typically via caches.
°
Power is a limiting factor
Use parallelism to improve performance
°
Computer designs measured by cost and
performance, as well as
energy, dependability, cost of ownership, and
scalability.
 
C
h
a
p
t
e
r
 
2
 
I
n
s
t
r
u
c
t
i
o
n
s
:
 
L
a
n
g
u
a
g
e
 
o
f
 
t
h
e
C
o
m
p
u
t
e
r
 
I
n
s
t
r
u
c
t
i
o
n
 
S
e
t
 
°
T
h
e
 
r
e
p
e
r
t
o
i
r
e
 
o
f
 
i
n
s
t
r
u
c
t
i
o
n
s
 
o
f
 
a
c
o
m
p
u
t
e
r
°
D
i
f
f
e
r
e
n
t
 
c
o
m
p
u
t
e
r
s
 
h
a
v
e
 
d
i
f
f
e
r
e
n
t
i
n
s
t
r
u
c
t
i
o
n
 
s
e
t
s
B
u
t
 
w
i
t
h
 
m
a
n
y
 
a
s
p
e
c
t
s
 
i
n
 
c
o
m
m
o
n
°
E
a
r
l
y
 
c
o
m
p
u
t
e
r
s
 
h
a
d
 
v
e
r
y
 
s
i
m
p
l
e
i
n
s
t
r
u
c
t
i
o
n
 
s
e
t
s
S
i
m
p
l
i
f
i
e
d
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
°
M
a
n
y
 
m
o
d
e
r
n
 
c
o
m
p
u
t
e
r
s
 
a
l
s
o
 
h
a
v
e
s
i
m
p
l
e
 
i
n
s
t
r
u
c
t
i
o
n
 
s
e
t
s
§2.1 Introduction
 
T
h
e
 
A
R
M
v
8
 
I
n
s
t
r
u
c
t
i
o
n
 
S
e
t
 
°
A
 
s
u
b
s
e
t
,
 
c
a
l
l
e
d
 
L
E
G
v
8
,
 
u
s
e
d
 
a
s
 
t
h
e
e
x
a
m
p
l
e
 
t
h
r
o
u
g
h
o
u
t
 
t
h
e
 
b
o
o
k
°
C
o
m
m
e
r
c
i
a
l
i
z
e
d
 
b
y
 
A
R
M
 
H
o
l
d
i
n
g
s
(
w
w
w
.
a
r
m
.
c
o
m
)
°
L
a
r
g
e
 
s
h
a
r
e
 
o
f
 
e
m
b
e
d
d
e
d
 
c
o
r
e
 
m
a
r
k
e
t
A
p
p
l
i
c
a
t
i
o
n
s
 
i
n
 
c
o
n
s
u
m
e
r
 
e
l
e
c
t
r
o
n
i
c
s
,
n
e
t
w
o
r
k
/
s
t
o
r
a
g
e
 
e
q
u
i
p
m
e
n
t
,
 
c
a
m
e
r
a
s
,
 
p
r
i
n
t
e
r
s
,
°
T
y
p
i
c
a
l
 
o
f
 
m
a
n
y
 
m
o
d
e
r
n
 
I
S
A
s
S
e
e
 
A
R
M
 
R
e
f
e
r
e
n
c
e
 
D
a
t
a
 
t
e
a
r
-
o
u
t
 
c
a
r
d
 
A
r
i
t
h
m
e
t
i
c
 
O
p
e
r
a
t
i
o
n
s
 
°
A
d
d
 
a
n
d
 
s
u
b
t
r
a
c
t
,
 
t
h
r
e
e
 
o
p
e
r
a
n
d
s
T
w
o
 
s
o
u
r
c
e
s
 
a
n
d
 
o
n
e
 
d
e
s
t
i
n
a
t
i
o
n
 
ADD a, b, c  // a gets b + c
°
A
l
l
 
a
r
i
t
h
m
e
t
i
c
 
o
p
e
r
a
t
i
o
n
s
 
h
a
v
e
 
t
h
i
s
 
f
o
r
m
°
E
a
c
h
 
L
E
G
v
8
 
a
r
i
t
h
m
e
t
i
c
 
i
n
s
t
r
u
c
t
i
o
n
 
p
e
r
f
o
r
m
s
o
n
l
y
 
o
n
e
 
o
p
e
r
a
t
i
o
n
 
a
n
d
 
m
u
s
t
 
a
l
w
a
y
s
 
h
a
v
e
e
x
a
c
t
l
y
 
t
h
r
e
e
 
v
a
r
i
a
b
l
e
s
°
D
e
s
i
g
n
 
P
r
i
n
c
i
p
l
e
 
1
:
 
S
i
m
p
l
i
c
i
t
y
 
f
a
v
o
u
r
s
r
e
g
u
l
a
r
i
t
y
R
e
g
u
l
a
r
i
t
y
 
m
a
k
e
s
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
 
s
i
m
p
l
e
r
S
i
m
p
l
i
c
i
t
y
 
e
n
a
b
l
e
s
 
h
i
g
h
e
r
 
p
e
r
f
o
r
m
a
n
c
e
 
a
t
 
l
o
w
e
r
c
o
s
t
§2.2 Operations of the Computer Hardware
 
A
r
i
t
h
m
e
t
i
c
 
E
x
a
m
p
l
e
 
°
C
o
m
p
i
l
i
n
g
 
T
w
o
 
C
 
A
s
s
i
g
n
m
e
n
t
 
S
t
a
t
e
m
e
n
t
s
 
i
n
t
o
L
E
G
v
8
:
a
 
=
 
b
 
+
c
 
;
d
 
=
 
a
 
 
e
 
;
°
C
o
m
p
i
l
e
d
 
L
E
G
v
8
 
c
o
d
e
:
A
D
D
 
a
,
 
b
,
 
c
S
U
B
 
d
,
 
a
,
 
e
°
C
o
m
p
i
l
i
n
g
 
a
 
C
o
m
p
l
e
x
 
C
 
A
s
s
i
g
n
m
e
n
t
 
i
n
t
o
 
L
E
G
v
8
:
 
f = (g + h) - (i + j);
°
C
o
m
p
i
l
e
d
 
L
E
G
v
8
 
c
o
d
e
:
 
ADD t0, g, h   // temp t0 = g + h
ADD t1, i, j   // temp t1 = i + j
SUB f, t0, t1  // f = t0 - t1
Slide Note
Embed
Share

Computer abstraction, instruction count, CPI, and performance metrics like clock cycles, CPU time, and program execution are crucial concepts in computer organization. Through examples and detailed explanations, this lecture explores how architecture, instruction sets, compilers, and algorithms impact performance. The presentation also discusses the power wall and its implications in the evolving computing landscape.


Uploaded on Aug 01, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. COSC 3406: Computer Organization Lecture 3: Computer Abstraction Kalpdrum Passi Fall 2016 ( www.cs.laurentian.ca/kpassi/cosc3406.html )

  2. Chapter 1 Chapter 1 Computer Abstractions Computer Abstractions and Technology and Technology

  3. Instruction Count and CPI = Clock Cycles Instructio Count n Cycles per Instructio n = CPU Time Instructio Count n CPI Clock Cycle Time Instructio Count n CPI = Clock Rate Instruction Count for a program Determined by program, ISA and compiler Average cycles per instruction Determined by CPU hardware If different instructions have different CPI - Average CPI affected by instruction mix

  4. CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? Instructio A = = CPU Time Count n CPI Cycle Time A A = I 2.0 250ps I 500ps A is faster = CPU Time Instructio Count n CPI Cycle Time B B B = = I 1.2 500ps I 600ps CPU Time I 600ps B = = 1.2 by this much CPU Time I 500ps A

  5. CPI in More Detail If different instruction classes take different numbers of cycles n = i = Clock Cycles (CPI Instructio Count n ) i i 1 Weighted average CPI Clock Cycles Instructio Count n n = i = = CPI CPI i i Instructio Count n Instructio Count n 1 Relative frequency

  6. CPI Example Alternative compiled code sequences using instructions in classes A, B, C Class CPI for class IC in sequence 1 IC in sequence 2 A 1 2 4 B 2 1 1 C 3 2 1 Sequence 1: IC = 5 Clock Cycles = 2 1 + 1 2 + 2 3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 Clock Cycles = 4 1 + 1 2 + 1 3 = 9 Avg. CPI = 9/6 = 1.5

  7. Performance Summary The BIG Picture The BIG Picture Instructio ns Clock cycles Seconds = CPU Time Program Instructio n Clock cycle Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc

  8. 1.7 The Power Wall Power Trends Power provides a limit to what we can cool, In the post-PC era the really valuable resource is energy. Dominant technology for IC is CMOS (complementary metal oxide semiconductor)

  9. Power Trends The primary source of energy consumption is so-called dynamic energy that is, energy that is consumed when transistors switch states from 0 to 1 and vice versa. The dynamic energy depends on the capacitive loading of each transistor and the voltage applied: 2 Energy Capacitive load Voltage This equation is the energy of a pulse during the logic transition of 0 1 0 or 1 0 1. The power required per transistor is just the product of energy of a transition and the frequency of transitions: = 2 Power Capacitive load Voltage Frequency 30 5V 1V 1000

  10. Reducing Power Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction 2 P C 0.85 (V 0.85) 2 old F 0.85 = = = 4 0.85 0.52 new old old old P C V F old old old The power wall We can t reduce voltage further We can t remove more heat How else can we improve performance?

  11. 1.8 The Sea Change: The Switch to Multiprocessors Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency

  12. Multiprocessors Multicore microprocessors More than one processor per chip Requires explicitly parallel programming Compare with instruction level parallelism - Hardware executes multiple instructions at once - Hidden from the programmer Hard to do - Programming for performance - Load balancing - Optimizing communication and synchronization

  13. SPEC CPU Benchmark Programs used to measure performance Supposedly typical of actual workload System Performance Evaluation Cooperative (SPEC) Develops benchmarks for CPU, I/O, Web, SPEC CPU2006 Elapsed time to execute a selection of programs - Negligible I/O, so focuses on CPU performance Dividing the execution time of a reference processor by the execution time of the evaluated computer normalizes the execution time measurements; This normalization yields a measure, called the SPECratio SPECratio is the inverse of execution time.

  14. CINT2006 for Intel Core i7 920 o A CINT2006 (integer) or CFP2006 (floating-point) Summary measurement is obtained by taking the geometric mean of SPECratios n = Execution time ratio n i i 1

  15. SPEC Power Benchmark Power consumption of servers at different workload levels, divided into 10% increments, over a period of time. SPECpower started with another SPEC benchmark for Java business applications (SPECJBB2005), It exercises the processors, caches, and main memory as well as the Java virtual machine, compiler, garbage collector, and pieces of the operating system. Performance is measured in throughput, and the units are business operations per second. 10 10 = i = i = Overall ssj_ops per Watt ssj_ops power i i 0 0 where ssj_opsi is performance at each 10% increment and poweri is power consumed at each performance level Power: Watts (Joules/sec)

  16. SPECpower_ssj2008 for Xeon X5650

  17. 1.10 Fallacies and Pitfalls Pitfall: Amdahl s Law Improving an aspect of a computer and expecting a proportional improvement in overall performance affected T = + improved T unaffected T improvemen factor t Example: multiply accounts for 80s/100s How much improvement in multiply performance to get 5 overall? 80 n Can t be done! = + 20 20 Corollary: make the common case fast

  18. Fallacy: Low Power at Idle Look back at i7 power benchmark At 100% load: 258W At 50% load: 170W (66%) At 10% load: 121W (47%) Google data center Mostly operates at 10% 50% load At 100% load less than 1% of the time Consider designing processors to make power proportional to load If future servers used, say, 10% of peak power at 10% workload, we could reduce the electricity bill of datacenters

  19. Fallacy: Designing for performance and designing for energy efficiency unrelated Energy is power over time Hardware or software optimizations that take less time save energy overall even if the optimization takes a bit more energy when it is used.

  20. Pitfall: MIPS as a Performance Metric MIPS: Millions of Instructions Per Second Doesn t account for - Differences in ISAs between computers - Differences in complexity between instructions Instructio count n = MIPS 6 Execution time 10 Instructio count n Clock rate = = Instructio count n CPI 6 CPI 10 6 10 Clock rate CPI varies between programs on a given CPU and so does MIPS

  21. 1.9 Concluding Remarks Concluding Remarks Cost/performance is improving Due to underlying technology development Hierarchical layers of abstraction In both hardware and software Instruction set architecture The hardware/software interface Execution time: the best performance measure Instructio Time Execution = ns Clock cycles Seconds Program Instructio n Clock cycle Individually the factors do not determine performance: only the product, is a reliable measure of performance.

  22. Concluding Remarks Two of the key ideas are exploiting parallelism in the program, via multiple processors, and exploiting locality of accesses to a memory hierarchy, typically via caches. Power is a limiting factor Use parallelism to improve performance Computer designs measured by cost and performance, as well as energy, dependability, cost of ownership, and scalability.

  23. Chapter 2 Instructions: Language of the Computer

  24. 2.1 Introduction Instruction Set The repertoire of instructions of a computer Different computers have different instruction sets But with many aspects in common Early computers had very simple instruction sets Simplified implementation Many modern computers also have simple instruction sets

  25. The ARMv8 Instruction Set A subset, called LEGv8, used as the example throughout the book Commercialized by ARM Holdings (www.arm.com) Large share of embedded core market Applications in consumer electronics, network/storage equipment, cameras, printers, Typical of many modern ISAs See ARM Reference Data tear-out card

  26. 2.2 Operations of the Computer Hardware Arithmetic Operations Add and subtract, three operands Two sources and one destination ADD a, b, c // a gets b + c ADD a, b, c // a gets b + c All arithmetic operations have this form Each LEGv8 arithmetic instruction performs only one operation and must always have exactly three variables Design Principle 1: Simplicity favours regularity Regularity makes implementation simpler Simplicity enables higher performance at lower cost

  27. Arithmetic Example Compiling Two C Assignment Statements into LEGv8: a = b +c ; d = a e ; Compiled LEGv8 code: ADD a, b, c SUB d, a, e Compiling a Complex C Assignment into LEGv8: f = (g + h) f = (g + h) - - ( (i i + j); + j); Compiled LEGv8 code: ADD t0, g, h // temp t0 = g + h ADD t0, g, h // temp t0 = g + h ADD t1, ADD t1, i i, j // temp t1 = , j // temp t1 = i i + j SUB f, t0, t1 // f = t0 SUB f, t0, t1 // f = t0 - - t1 + j t1

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#