Understanding von Neumann Architecture in Parallel & Distributed Systems
Exploring the von Neumann architecture, this lecture delves into the components like main memory, CPU, registers, and data transfer. It discusses the bottleneck problem and modifications made to enhance CPU performance, such as caching methods. The web presentation offers insights into key aspects of parallel and distributed systems.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Parallel & Distributed Systems Lecture 02 Web site: https://uqu.edu.sa/aamosman/10066 Email: aamosman@uqu.edu.sa
The von Neumann architecture Consist of: 1. Main memory 2. Central Processing Unit (CPU) or processor or core 3. Interconnection between the memory and the CPU.
CPU CPU is divided into: Control Unit (CU): Which instruction in a programshould be executed next Arithmetic and Logic Unit (ALU): Executes the instructions
Main Memory A collection of locations for storing instructions and data. Each location has unique address used to access the location
Registers Registers are very fast storage, reside inside the CPU. Some Program counter , which is used to stores address of the next instruction to be executed. registers have special takes, like
Data Transfer Data transferred between main memory and CPU through bus wire. At the first CPU and main memory are at the same speed.
Bottleneck problem In 2010 CPUs are capable of executing instructions more than one hundred times faster than they can fetch items from main memory. In order to address the von Neumann bottleneck (improve CPU performance), many modifications have been done in the basic von Neumann architecture.
MODIFICATIONS TO THE VON NEUMANN MODEL Caching is methods of bottleneck. one of the most widely addressing the von used Neumann Caches make it possible for the CPU to quickly access instructions and data that are in main memory.
Cache levels L1: very fast, very small (inside CPU) L2: slower and lager than L1 L3: slowest cache and largest one Cache hit Cache miss
Virtual memory If we run a very large program or a program that accesses very large data sets, all of the instructions and data may not fit into main memory. Virtual memory was developed so that main memory can function as a cache for secondary storage.
Keeping in main memory only the active parts of the many running programs; those parts that are idle are kept in a block of secondary storage called swap space. Like CPU caches, virtual memory operates on blocks of data and instructions. These blocks are commonly called pages, and since secondary storage access can be hundreds of thousands of times slower than main memory access, pages are relatively large (page size ranges from 4 to 16 KB).
Instruction-level parallelism (ILP) There are two main approaches to ILP: 1. Pipelining: functional units are arranged in stages 2. Multiple issue: multiple instructions can be simultaneously initiated. Both approaches are used in virtually all modern CPUs.
Pipelining The principle of pipelining is similar to a factory assembly line. Image from http://robohub.org/wp-content/uploads/2014/04
Pipeline example To add the floating point numbers 9.87 104 and steps: 6.54 103, we can use the following if each of the operations takes one nanosecond (10 9seconds), the addition operation will take seven nanoseconds
if we execute the code: the for loop will take something like 7000 nanoseconds.
As an alternative, suppose we divide our floating point adder into seven separate pieces of hardware or functional units. The first unit will fetch two operands, the second will compare exponents, and so on. Also suppose that the output of one functional unit is the input to the next. Then a single floating point addition will still take seven nanoseconds. However, when we execute the for loop, we can fetch x[1] and y[1] while we re comparing the exponents of x[0] and y[0]. More generally, it s possible for us to simultaneously execute seven different stages in seven different additions.
Pipelined Addition. Numbers in the Table Are Subscripts of Operands/Results
Pipeline performance One floating point addition still takes 7 nanoseconds. But 1000 floating point additions now takes 1006 nanoseconds (how?)
Multiple issue improve performance individual pieces of hardware or functional Pipelines by taking units and connecting them in sequence. Multiple issue processors replicate functional units and try to different instructions in a program. simultaneously execute
For example, if we have two complete floating point adders, we can approximately halve the time it takes to execute the loop for (i = 0; i < 1000; i++) z[i] = x[i] + y[i]; While the first adder is computing z[0], the second can compute z[1]; while the first is computing z[2], the secondcan compute z[3]; and so on. Note: in order to make use of multiple issue, the system must find instructions that can be executed simultaneously
References An Introduction to Parallel Programming, Peter Pacheco, 2011, Elsevier Parallel Programming (in Arabic), Abdelrahman Osman, 2016, https://www.researchgate.net/publication/30 7955739_Parallel_Programming