Trends in Implicit Parallelism and Microprocessor Architectures

Slide Note
Embed
Share

Explore the implications of implicit parallelism in microprocessor architectures, addressing performance bottlenecks in processor, memory system, and datapath components. Prof. Vijay More delves into optimizing resource utilization, diverse architectural executions, and the impact on current computer systems. Discover the scope and varied applications of parallelism in modern technology.


Uploaded on Oct 03, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Parallel Programming Platforms Implicit Parallelism: Trends in Microprocessor & Architectures, Limitations of Memory System Performance Prof Vijay More, MET, IOE, BKC, Nashik

  2. Scope of Parallelism Conventional architectures comprise of a processor, memory system, and the datapath. Each of these components presents considerable performance bottlenecks. Parallelism addresses each of these components in significant ways. Prof Vijay More, MET, IOE, BKC, Nashik

  3. Different applications utilize different aspects of parallelism since their requirement is different e.g., o data intensive applications utilize high aggregate throughput, o server applications utilize high aggregate network bandwidth, o and scientific applications typically utilize high processing and memory system performance. It is important to understand each of these performance bottlenecks. Prof Vijay More, MET, IOE, BKC, Nashik

  4. Implicit Parallelism: Trends in Microprocessor Architectures Microprocessor clock speeds have recorded impressive gains over the past two decades (two to three orders of magnitude). Higher levels of device integration have made available a large number of transistors. Prof Vijay More, MET, IOE, BKC, Nashik

  5. How best to utilize these resources?. Current processors use these resources in multiple functional units and execute multiple instructions in the same cycle. The precise manner in which these instructions are selected and executed provides remarkable diversity in architectures. Increased overhead of compiler to find out parallelism from the code if it is implicit. Prof Vijay More, MET, IOE, BKC, Nashik

  6. Current Generation Computer Systems 1 BlueGene/L (IBM / LLNL) 2 Purple (IBM / LLNL) 3 Q (HP / LANL) 4 Red Storm (Cray / SNL) 5 X-1 (Cray / ORNL) 6 Earth Simulator (NEC / Japan) 7 Linux Clusters (Various Developers / Various Sites) 8 System X (Apple / Virginia Tech) 9 Columbia (SGI / NASA) 10 MareNostrum (IBM / Spain) Prof Vijay More, MET, IOE, BKC, Nashik

  7. Limitations of Memory System Performance Memory system, (and not processor speed), is often the bottleneck for many applications. Memory system performance is largely affected by two parameters, latency and bandwidth. Latency is the time from the issue of a memory request to the time the data is available at the processor. Bandwidth is the rate at which data can be transferred between processor and memory system. Prof Vijay More, MET, IOE, BKC, Nashik

  8. Memory System Performance: Bandwidth and Latency Consider the example of a fire-hose. If the water comes outof the hose two seconds after the hydrant is turned on, thelatency of the system is two seconds. Once the water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second, the bandwidth of the system is 5 gallons/second. If you want immediate response from the hydrant, it is important to reduce latency. If you want to fight big fires, you want high bandwidth. Prof Vijay More, MET, IOE, BKC, Nashik

  9. Memory Latency: An Example Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns (without caches). Assume that the processor has two multiply-add units and is capable of executing four instructions in each cycle of 1 ns. The following observations follow: The peak processor rating is 4 GFLOPS. Since the memory latency is 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data. Prof Vijay More, MET, IOE, BKC, Nashik

  10. Memory Latency: An Example On the above architecture, consider the problem of computing a dot-product of two vectors. A dot-product computation performs one multiply- add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. It follows that the peak speed of this computation is limited to one floating point operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak processor rating! Prof Vijay More, MET, IOE, BKC, Nashik

  11. FLOPS Floating point operations per second. Floating point operations are heavily used by science and engineering computational applications. This is one of the measure of performance for the system which is abbreviated as flops with prefix: mega 106 megaflops MFLOPS giga 109 gigaflops GFLOPS tera 1012 teraflops TFLOPS peta 1015 petaflops PFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

  12. MFLOPS megaflops also called millions of floating point operations per second. MFLOPS rating is computed by dividing number of operations by execution time: Ex. multiplication of nxn matrix number of operations are 2n3-n n2 inner products with n multiplications and n-1 additions in each product. if 100x100 matrix multiplication computed in 0.35 seconds, then MFLOPS rate = (2 (100)3 -100) / 0.35 = 5,714,000 ops per second = 5.714 MFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

  13. Improving Effective Memory Latency Using Caches Caches are small and fast memory elements between the processor and DRAM. Important property is low-latency high-bandwidth storage. If a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. The fraction of data references satisfied by the cache is called the cache hit ratio of the computation on the system. Prof Vijay More, MET, IOE, BKC, Nashik

  14. Impact of Caches: Example Consider the architecture from the previous example. In this case, we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. We use this setup to multiply two matrices A and B of dimensions 32 32. We have carefully chosen these numbers so that the cache is large enough to store matrices A and B, as well as the result matrix C. Prof Vijay More, MET, IOE, BKC, Nashik

  15. Impact of Caches: Example (continued) Observations: Fetching the two matrices into the cache corresponds to fetching 2K words, which takes approximately 200 s (2000 100ns). Multiplying two n n matrices takes 2n3 operations. For our problem, this corresponds to 64K operations, which can be performed in 16K cycles (or 16 s) at four instructions per cycle. Prof Vijay More, MET, IOE, BKC, Nashik

  16. The total time for the computation is approximately = time for load/store operations + time for the computation itself, = (200 + 16) s. = 216 s This corresponds to a peak computation rate of 64K FLOP/216 s or 303 MFLOPS. Prof Vijay More, MET, IOE, BKC, Nashik

  17. Theoretical Peak MFLOPS How many numerical operations can perform in one second by the machine without performing other task. Time required to compute one operation, how many of them can be performed in one second. Ex. If there are 8 cycles needed to perform one numeric multiplication for one second: 1 multiplica ---------------- X 160 ns 1,000,000,000 ----------------- 1 sec and cycle time is 20 ns = 6.25 x 106 multiplications / second then it requires 160 ns to perform one multiplication = 6.25 MFLOPS Prof Vijay More, MET, IOE, BKC, Nashik

  18. Impact of Caches In our example, we had O(n2) O(n3) computation. which is desirable situation for caches. data accesses and Prof Vijay More, MET, IOE, BKC, Nashik

  19. Registers in CPU Level 0 Increase in Capacity and Access time Increase in Cost per bit Cache sRAM Level 1 Main Memory dRAM Level 2 Disk Storage (Solid state, magnetic) Level 3 Level 4 Backup Storage (magnetic tape, optical disk) Capacity Prof Vijay More, MET, IOE, BKC, Nashik

  20. CPU 1: Access by word (4, 8 byte) from a cache block of 32 bytes, e.g. Block a Inclusion property & Data transfer between adjacent levels of memory hierarchy Registers b a M1: Cache 2: Access by block (32 byte) from memory page of 32 blocks, e.g. Block b from Page B Page B b M2: Main Memory 3: Access by Page (1KByte) from file segment e.g. Segment F a Page A Segment G a Segment F M3: Disk Storage Page A Page B b Segment F M4: Magnetic Tape unit (backup storage) a Page B Page A b Segment G Prof Vijay More, MET, IOE, BKC, Nashik

  21. Inclusion, Coherence, and Locality properties Information stored in memory hierarchy satisfy these three properties inclusion, coherence, and locality. Cache memory at innermost level M1 directly communicates with registers of the CPU. All information words are stored at outermost level Mn Prof Vijay More, MET, IOE, BKC, Nashik

  22. The set inclusion relationship is: M1 M2 M3 Mn This shows that all information items are originally stored in outermost level Mn. At the time of processing, subset of Mn are copied into Mn-1, Similarly, subset of Mn-1 are copied into Mn-2, and so on. Prof Vijay More, MET, IOE, BKC, Nashik

  23. If information word is found in Mi, then copies of the same word can also be found in all upper levels Mi+1, Mi+2,....,Mn. Whereas, word found in Mi+1 may not be found in Mi. M1 Mi-1 Mi Mi+1 Mn Prof Vijay More, MET, IOE, BKC, Nashik

  24. A word miss in Mi indicates that it is also missing from all lower levels Mi-1, Mi-2,....,M1. Highest level is the backup storage where everything can be found. Information transfer between CPU and cache is in terms of words (8, 16 bytes). The cache (M1) is divided into cache blocks, called cache lines. Each block is 32 bytes. (e.g. blocks a, b in figure) Prof Vijay More, MET, IOE, BKC, Nashik

  25. Main memory (M2) is divided into pages (4KBytes each). Each page has 128 blocks. Pages are the unit of transfer between main memory and disk storage. Main Memory CPU Registers Disk Storage Cache 4-8 bytes Per word transfer 4 K bytes Page transfer file Transfer 512 KBytes 32 bytes block transfer Scattered pages are organized in the form of segment in disk memory. The size of segment varies depends on user needs. Prof Vijay More, MET, IOE, BKC, Nashik

  26. Coherence Property: Copy of same information item at successive level must be consistent. If the word is modified at Cache, must immediately be updated at all higher levels. The hierarchy should be maintained in such way. Frequently used information is often available at lower level to minimize access time. Prof Vijay More, MET, IOE, BKC, Nashik

  27. Coherence is maintained in two ways. i. Write-through (WT) demands immediate update in Mi+1 when word in Mi is updated. ii. Write-back (WB) which delays the update in Mi+1 when the word in Mi is modified. Prof Vijay More, MET, IOE, BKC, Nashik

  28. Locality of Reference Property: Particular portion of memory address space is accessed frequently by a program during its execution in any time window. E.g. Innermost loop. There are three dimensions of locality property. 1. Temporal locality 2. Spatial locality 3. Sequential locality Prof Vijay More, MET, IOE, BKC, Nashik

  29. 1. Temporal Locality Items recently referred are likely to be referenced in near future by loop, stack, temporary variables, or subroutines, etc. Once a loop is entered or subroutine is called, a small code segment is referenced many times repeatedly. This temporal locality is clustered in recently used areas. Prof Vijay More, MET, IOE, BKC, Nashik

  30. 2. Spatial Locality It is the tendency of a process to access items whose addresses are near to each other, e.g. tables, arrays involves access to special areas which are clustered together. Program segment containing subroutines and macros are kept together in the neighbourhood of memory space. Prof Vijay More, MET, IOE, BKC, Nashik

  31. 3. Sequential Locality Execution of instructions in a program follow a sequential order unless out-of-order instructions encountered. The ratio of in-order execution to out-of-order execution is generally 5 to 1. Prof Vijay More, MET, IOE, BKC, Nashik

  32. Hit Ratio If desired information is found in Mi If desired information is not found in Mi - hit - miss The hit ratio hi at Mi is the probability that information item will be found in Mi Miss is defined as (1 hi) Hit ratio of successive levels is the function of memory capacity, program behaviour, etc. Assume h0 = 0 and hn = 1 Prof Vijay More, MET, IOE, BKC, Nashik

  33. Access Frequency The access frequency to Mi is fi where, fi = (1-h1) (1-h2). . . . (1-hi-1) hi is the probability of sequential access to Mi when there are i-1 misses at the lower levels and hit at Mi Prof Vijay More, MET, IOE, BKC, Nashik

  34. n = i = 1 if and f1 = h1 1 Access frequency decreases rapidly due to locality property i.e. f1 f2 fn also indicates that, innermost memory level are accessed frequently than outer levels. Prof Vijay More, MET, IOE, BKC, Nashik

  35. Effective Access Time Misses are called Block misses Page fault at at cache memory Time penalty in page fault is larger than in block misses. Cache miss is 2 to 4 times costly than cache hit. But, page fault is 1000 to 10000 times costly as a page hit. Prof Vijay More, MET, IOE, BKC, Nashik

  36. The effective access time i=1n fi . ti Teff = = h1.t1 + (1-h1)h2t2 + (1-h1)(1-h2)h3t3 + . . . . + (1-h1)(1-h2)...(1-hn-1).tn Still effective access time depends on program behaviour an memory hierarchy design Prof Vijay More, MET, IOE, BKC, Nashik

  37. Hierarchical Optimization n Total Cost = i = C C s total i i 1 Since, Cost C1 > C2 > . . . . > Cn, we need to select size (capacity) s1 < s2 < . . . . < sn Memory hierarchy should result in a Teff (effective access time) close to t1 of M1. Prof Vijay More, MET, IOE, BKC, Nashik

  38. The effective access time i=1n fi . ti Teff = The optimized effective access time should be minimized and close to t1. Ctotal must be less than C0, where C0 is the ceiling on total cost. There is tradeoff at all levels, and this represents non-deterministic problem. (LPP). Prof Vijay More, MET, IOE, BKC, Nashik

  39. Example: Design a three level memory hierarchy for given parameters. Memory Level Access Time Capacity Cost / K byte Cache t1 = 25ns s1 = 512 Kbyte c1 = $0.12 Main memory t2 = unknown s2 = 32 Mbytes c2 = $0.02 Disk t3 = 4ms s3 = unknown c3 = $0.00002 Goal is to achieve effective access time t = 850 ns cache hit ratio h1 = 0.98, h2 = 0.99 in main memory C0 is $1500. Prof Vijay More, MET, IOE, BKC, Nashik

  40. Effective access time t = h1.t1 + (1-h1)h2t2 + (1-h1)(1-h2)h3t3 850x10-9 = 0.98x25x10-9 + (1-0.98)x0.99xt2 + (1-0.98)(1-0.99)x1x4x10-3 t2 = 1250ns 1500 = 0.12x512 + 0.02x32x103 + 0.00002xs3 s3 = 40 GBytes max Prof Vijay More, MET, IOE, BKC, Nashik

  41. Prof Vijay More, MET, IOE, BKC, Nashik

  42. Memory design implications Temporal locality uses least recently used (LRU) replacement algorithm. Spatial locality determines size of unit data transfer between adjacent memory levels. Sequential locality affect grain packing for optimal scheduling. Prefetch are heavily used. Prof Vijay More, MET, IOE, BKC, Nashik

  43. Impact of Memory Bandwidth Memory bandwidth is determined by the bandwidth of the memory bus as well as the memory units. Memory bandwidth can be improved by increasing the size of memory blocks. Consider the same setup as before, except in this case, the block size is 4 words instead of 1 word. We repeat the dot-product computation in this scenario: Assuming that the vectors occupied linearly in memory, eight FLOPs (four multiply-adds) can be performed in 200 cycles. Prof Vijay More, MET, IOE, BKC, Nashik

  44. This is because a single memory access fetches four consecutive words in the vector. Therefore, two accesses can fetch four elements of each of the vectors. This corresponds to 1 FLOP every 25 ns, for a peak speed of 40 MFLOPS. It is important to note that increasing block size does not change latency of the system. Physically, this scenario can be viewed as a wide data bus (4 words or 128 bits) connected to multiple memory banks. In practice, such wide buses are expensive to construct. In a more practical system, consecutive words are sent on the memory bus on subsequent bus cycles after the first word is retrieved. Prof Vijay More, MET, IOE, BKC, Nashik

  45. Data access patterns is one of the factor which can be used to improve performance. In spatial locality of reference, consecutive data words in memory were used by successive instructions . If we take a data-layout centric view, computations must be reordered to enhance spatial locality of reference. Prof Vijay More, MET, IOE, BKC, Nashik

  46. Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (i = 0; i < 1000; i++) for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vector column_sum. int a[2][3]; for(int i=0;i<2;i++){ for (int j=0;j<3;j++) cout<< \t <<&a[i][j]; cout<<endl; } ----------------------------------------------------------- 0x7fff9987c700 0x7fff9987c704 0x7fff9987c708 0x7fff9987c70c 0x7fff9987c710 0x7fff9987c714 Example for quick reference of data layout Prof Vijay More, MET, IOE, BKC, Nashik

  47. We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is translated in a row-order and performance can be expected to be significantly better. Prof Vijay More, MET, IOE, BKC, Nashik

Related