Understanding Computer Abstraction and Performance Metrics
Computer abstraction, instruction count, CPI, and performance metrics like clock cycles, CPU time, and program execution are crucial concepts in computer organization. Through examples and detailed explanations, this lecture explores how architecture, instruction sets, compilers, and algorithms impact performance. The presentation also discusses the power wall and its implications in the evolving computing landscape.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
COSC 3406: Computer Organization Lecture 3: Computer Abstraction Kalpdrum Passi Fall 2016 ( www.cs.laurentian.ca/kpassi/cosc3406.html )
Chapter 1 Chapter 1 Computer Abstractions Computer Abstractions and Technology and Technology
Instruction Count and CPI = Clock Cycles Instructio Count n Cycles per Instructio n = CPU Time Instructio Count n CPI Clock Cycle Time Instructio Count n CPI = Clock Rate Instruction Count for a program Determined by program, ISA and compiler Average cycles per instruction Determined by CPU hardware If different instructions have different CPI - Average CPI affected by instruction mix
CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? Instructio A = = CPU Time Count n CPI Cycle Time A A = I 2.0 250ps I 500ps A is faster = CPU Time Instructio Count n CPI Cycle Time B B B = = I 1.2 500ps I 600ps CPU Time I 600ps B = = 1.2 by this much CPU Time I 500ps A
CPI in More Detail If different instruction classes take different numbers of cycles n = i = Clock Cycles (CPI Instructio Count n ) i i 1 Weighted average CPI Clock Cycles Instructio Count n n = i = = CPI CPI i i Instructio Count n Instructio Count n 1 Relative frequency
CPI Example Alternative compiled code sequences using instructions in classes A, B, C Class CPI for class IC in sequence 1 IC in sequence 2 A 1 2 4 B 2 1 1 C 3 2 1 Sequence 1: IC = 5 Clock Cycles = 2 1 + 1 2 + 2 3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 Clock Cycles = 4 1 + 1 2 + 1 3 = 9 Avg. CPI = 9/6 = 1.5
Performance Summary The BIG Picture The BIG Picture Instructio ns Clock cycles Seconds = CPU Time Program Instructio n Clock cycle Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc
1.7 The Power Wall Power Trends Power provides a limit to what we can cool, In the post-PC era the really valuable resource is energy. Dominant technology for IC is CMOS (complementary metal oxide semiconductor)
Power Trends The primary source of energy consumption is so-called dynamic energy that is, energy that is consumed when transistors switch states from 0 to 1 and vice versa. The dynamic energy depends on the capacitive loading of each transistor and the voltage applied: 2 Energy Capacitive load Voltage This equation is the energy of a pulse during the logic transition of 0 1 0 or 1 0 1. The power required per transistor is just the product of energy of a transition and the frequency of transitions: = 2 Power Capacitive load Voltage Frequency 30 5V 1V 1000
Reducing Power Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction 2 P C 0.85 (V 0.85) 2 old F 0.85 = = = 4 0.85 0.52 new old old old P C V F old old old The power wall We can t reduce voltage further We can t remove more heat How else can we improve performance?
1.8 The Sea Change: The Switch to Multiprocessors Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency
Multiprocessors Multicore microprocessors More than one processor per chip Requires explicitly parallel programming Compare with instruction level parallelism - Hardware executes multiple instructions at once - Hidden from the programmer Hard to do - Programming for performance - Load balancing - Optimizing communication and synchronization
SPEC CPU Benchmark Programs used to measure performance Supposedly typical of actual workload System Performance Evaluation Cooperative (SPEC) Develops benchmarks for CPU, I/O, Web, SPEC CPU2006 Elapsed time to execute a selection of programs - Negligible I/O, so focuses on CPU performance Dividing the execution time of a reference processor by the execution time of the evaluated computer normalizes the execution time measurements; This normalization yields a measure, called the SPECratio SPECratio is the inverse of execution time.
CINT2006 for Intel Core i7 920 o A CINT2006 (integer) or CFP2006 (floating-point) Summary measurement is obtained by taking the geometric mean of SPECratios n = Execution time ratio n i i 1
SPEC Power Benchmark Power consumption of servers at different workload levels, divided into 10% increments, over a period of time. SPECpower started with another SPEC benchmark for Java business applications (SPECJBB2005), It exercises the processors, caches, and main memory as well as the Java virtual machine, compiler, garbage collector, and pieces of the operating system. Performance is measured in throughput, and the units are business operations per second. 10 10 = i = i = Overall ssj_ops per Watt ssj_ops power i i 0 0 where ssj_opsi is performance at each 10% increment and poweri is power consumed at each performance level Power: Watts (Joules/sec)
1.10 Fallacies and Pitfalls Pitfall: Amdahl s Law Improving an aspect of a computer and expecting a proportional improvement in overall performance affected T = + improved T unaffected T improvemen factor t Example: multiply accounts for 80s/100s How much improvement in multiply performance to get 5 overall? 80 n Can t be done! = + 20 20 Corollary: make the common case fast
Fallacy: Low Power at Idle Look back at i7 power benchmark At 100% load: 258W At 50% load: 170W (66%) At 10% load: 121W (47%) Google data center Mostly operates at 10% 50% load At 100% load less than 1% of the time Consider designing processors to make power proportional to load If future servers used, say, 10% of peak power at 10% workload, we could reduce the electricity bill of datacenters
Fallacy: Designing for performance and designing for energy efficiency unrelated Energy is power over time Hardware or software optimizations that take less time save energy overall even if the optimization takes a bit more energy when it is used.
Pitfall: MIPS as a Performance Metric MIPS: Millions of Instructions Per Second Doesn t account for - Differences in ISAs between computers - Differences in complexity between instructions Instructio count n = MIPS 6 Execution time 10 Instructio count n Clock rate = = Instructio count n CPI 6 CPI 10 6 10 Clock rate CPI varies between programs on a given CPU and so does MIPS
1.9 Concluding Remarks Concluding Remarks Cost/performance is improving Due to underlying technology development Hierarchical layers of abstraction In both hardware and software Instruction set architecture The hardware/software interface Execution time: the best performance measure Instructio Time Execution = ns Clock cycles Seconds Program Instructio n Clock cycle Individually the factors do not determine performance: only the product, is a reliable measure of performance.
Concluding Remarks Two of the key ideas are exploiting parallelism in the program, via multiple processors, and exploiting locality of accesses to a memory hierarchy, typically via caches. Power is a limiting factor Use parallelism to improve performance Computer designs measured by cost and performance, as well as energy, dependability, cost of ownership, and scalability.
Chapter 2 Instructions: Language of the Computer
2.1 Introduction Instruction Set The repertoire of instructions of a computer Different computers have different instruction sets But with many aspects in common Early computers had very simple instruction sets Simplified implementation Many modern computers also have simple instruction sets
The ARMv8 Instruction Set A subset, called LEGv8, used as the example throughout the book Commercialized by ARM Holdings (www.arm.com) Large share of embedded core market Applications in consumer electronics, network/storage equipment, cameras, printers, Typical of many modern ISAs See ARM Reference Data tear-out card
2.2 Operations of the Computer Hardware Arithmetic Operations Add and subtract, three operands Two sources and one destination ADD a, b, c // a gets b + c ADD a, b, c // a gets b + c All arithmetic operations have this form Each LEGv8 arithmetic instruction performs only one operation and must always have exactly three variables Design Principle 1: Simplicity favours regularity Regularity makes implementation simpler Simplicity enables higher performance at lower cost
Arithmetic Example Compiling Two C Assignment Statements into LEGv8: a = b +c ; d = a e ; Compiled LEGv8 code: ADD a, b, c SUB d, a, e Compiling a Complex C Assignment into LEGv8: f = (g + h) f = (g + h) - - ( (i i + j); + j); Compiled LEGv8 code: ADD t0, g, h // temp t0 = g + h ADD t0, g, h // temp t0 = g + h ADD t1, ADD t1, i i, j // temp t1 = , j // temp t1 = i i + j SUB f, t0, t1 // f = t0 SUB f, t0, t1 // f = t0 - - t1 + j t1