Insights into Matrix Multiplication Performance Optimization

 
Fat by Thin Matrix
Multiplication
 
Thomas Hines
Tennessee Tech University
 
Fat and Thin Matrices
 
Fat – Few rows and many columns
Thin – Many rows and few columns
Fat by Thin – k >> m , n
C is very small compared to A and B
 
*
 
=
 
Experimental Setup
 
Ivybride node and Broadwell node
Ivybridge E5-2680v2 x 2 with 20 total cores
Broadwell E5-2680v4 x 2 with 28 total cores
Library versions
Intel MKL 2019.0.3
BLIS 0.5.1
OpenBLAS 0.3.6
Call DGEMM
Best of 10 runs
 
Experimental Setup
 
Ivybridge 20 threads sweep
 
Graph of ivy_bridge sweep
 
Broadwell 28 threads sweep
 
Graph of Broadwell sweep
 
Multi-threaded vs Single threaded
 
The next step is to look at single threaded performance
 
But what if a global resource is the bottleneck?
RAM bandwidth
 
Run a single threaded DGEMM on each core at the same time
 
Ivybridge 20 independent threads
 
Broadwell 28 independent threads
 
Rough “Fix”
 
Single threaded performance is high in the fat by thin region
Turn the multiplication into independent smaller multiplications
Split A and B in the k dimension
Adds extra work
Extra memory for temporary C’s
Sum reduction of the temporary C’s
However, C is small
 
Ivybridge default vs duplicate C
 
Broadwell default vs duplicate C
 
BLIS Loop Multi-threading
 
Duplicate-C BLIS Loop Multi-threading
 
Potential Modifications
 
Simple
Switch point to swap to alternative loop threading.
 
Complicated
Determine loop threading based on m, k, and n.
 
In both cases the 2
nd
 loop needs to be parallelizable
Slide Note
Embed
Share

Explore the performance of matrix multiplication with fat and thin matrices using experimental setups with different hardware configurations. Evaluate single-threaded versus multi-threaded performance and strategies for enhancing performance in the fat-by-thin region.

  • Matrix multiplication
  • Performance optimization
  • Experimental setup
  • Hardware configurations
  • Single-threaded

Uploaded on Oct 01, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Fat by Thin Matrix Multiplication Thomas Hines Tennessee Tech University

  2. Fat and Thin Matrices Fat Few rows and many columns Thin Many rows and few columns Fat by Thin k >> m , n C is very small compared to A and B = *

  3. Experimental Setup Ivybride node and Broadwell node Ivybridge E5-2680v2 x 2 with 20 total cores Broadwell E5-2680v4 x 2 with 28 total cores Library versions Intel MKL 2019.0.3 BLIS 0.5.1 OpenBLAS 0.3.6 Call DGEMM Best of 10 runs

  4. Experimental Setup Sweep from fat by thin to square Start: (m, k, n) = (64, 16 777 216, 64) End: (m, k, n) = (4096, 4096, 4096) Kept the total number of multiplications constant ? ? ? = 40963 Rough performance predictor Time to load A and B from RAM + time to multiply

  5. Ivybridge 20 threads sweep Graph of ivy_bridge sweep

  6. Broadwell 28 threads sweep Graph of Broadwell sweep

  7. Multi-threaded vs Single threaded The next step is to look at single threaded performance But what if a global resource is the bottleneck? RAM bandwidth Run a single threaded DGEMM on each core at the same time

  8. Ivybridge 20 independent threads

  9. Broadwell 28 independent threads

  10. Rough Fix Single threaded performance is high in the fat by thin region Turn the multiplication into independent smaller multiplications Split A and B in the k dimension Adds extra work Extra memory for temporary C s Sum reduction of the temporary C s However, C is small

  11. Ivybridge default vs duplicate C

  12. Broadwell default vs duplicate C

  13. BLIS Loop Multi-threading Ivybridge with 20 threads parallelizes the loops as: 1, 1, 10, 2, 1 The third loop partitions in the m dimension If m=64, then the effective ?? is 6.4 (rounds to 8) =sliver of The preferred ?? size is 96 ?? will be smaller than the preferred size when m<960

  14. Duplicate-C BLIS Loop Multi-threading The loop threading is effectively 1, 20, 1, 1, 1 Not identical no synchronization The third loop is not parallelized Effective ?? is roughly 10x larger for small m

  15. Potential Modifications Simple Switch point to swap to alternative loop threading. Complicated Determine loop threading based on m, k, and n. In both cases the 2nd loop needs to be parallelizable

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#