Dynamic Load Balancing in Block-Sparse Tensor Contractions

Slide Note
Embed
Share

This paper discusses load balancing algorithms for block-sparse tensor contractions, focusing on dynamic load balancing challenges and implementation strategies. It explores the use of Global Arrays (GA), performance experiments, Inspector/Executor design, and dynamic buckets implementation to optimize load distribution. The study highlights the importance of load balancing in high-performance computing applications like NWChem and Coupled Cluster, emphasizing the need for efficient algorithms to achieve optimal performance. Computational challenges in handling complex molecular simulations are also addressed, underlining the significance of load balance for enhanced system efficiency.


Uploaded on Sep 17, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond , James Dinan , Pavan Balaji , Sameer Shende*, Allen Malony* *University of Oregon Argonne National Laboratory 2013 International Conference on Parallel Processing (ICPP) October 2, 2013

  2. Outline 1. NWChem, Coupled Cluster, Tensor Contraction Engine 2. Load Balance Challenges 3. Dynamic Load Balancing with Global Arrays (GA) 4. Nxtval Performance Experiments 5. Inspector/Executor Design 6. Performance Modeling (DGEMM and TCE Sort) 7. Largest Processing Time (LPT) Algorithm 8. Dynamic Buckets Design and Implementation 9. Results 10. Conclusions 11. Future Work

  3. NWChem and Coupled Cluster NWChem: Wide range of methods, accuracies, and supported supercomputer architectures Well-known for its support of many quantum mechanical methods on massively parallel systems. Built on top of Global Arrays (GA) / ARMCI Coupled Cluster (CC): Ab initio - i.e., Highly accurate Solves an approximate Schr dinger Equation Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ The respective computational costs: ( ) ( ) ( O n O n O 6 7 8 9 10 ) ( ) ( ) n O n O n And respective storage costs: ) ( ) ( n O n O 4 4 6 6 8 ( ) ( ) ( ) O n O n O n *Photos from nwchem-sw.org

  4. NWChem and Coupled Cluster NWChem: Wide range of methods, accuracies, and supported supercomputer architectures Well-known for its support of many quantum mechanical methods on massively parallel systems. Built on top of Global Arrays (GA) / ARMCI Distributed Memory Spaces Coupled Cluster (CC): Ab initio - i.e., Highly accurate Solves an approximate Schr dinger Equation Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ The respective computational costs: ( ) ( ) ( O n O n O 6 7 8 9 10 ) ( ) ( ) n O n O n Global Address Space And respective storage costs: ) ( ) ( n O n O 4 4 6 6 8 ( ) ( ) ( ) O n O n O n *Diagram from GA tutorial (ACTS 2009)

  5. DGEMM Tasks - Load Imbalance In CCSX (X=D,T,Q), 1 tensor contraction contains between 1 hundred and 1 million DGEMMs MFLOPs per task depend on: number of atoms Spin and spatial symmetry Accuracy of chosen basis The tile size

  6. Computational Challenges Water Clusters Benzene Macro-Molecules QM/MM Highly symmetric Asymmetric Load balance is crucially important for performance Obtaining optimal load balance is an NP-Hard problem. *Photos from nwchem-sw.org

  7. GA Dynamic Load Balancing Template

  8. GA Dynamic Load Balancing Template

  9. GA Dynamic Load Balancing Template

  10. GA Dynamic Load Balancing Template

  11. GA Dynamic Load Balancing Template

  12. GA Dynamic Load Balancing Template

  13. GA Dynamic Load Balancing Template

  14. GA Dynamic Load Balancing Template

  15. GA Dynamic Load Balancing Template Works best when: On a single node (in SysV shared memory) Time spent in FOO(a) is huge On high-speed interconnects Number of simultaneous calls is reasonably small (less than 1,000).

  16. Nxtval - Performance Experiments TAU Profiling 14 water molecules, aug-cc-PVDZ 123 nodes, 8 ppn Nxtvalconsumes a large percentage of the execution time. Flooding micro-benchmark Proportional time within Nxtval increases with more participating processes. When the arrival rate exceeds the processing rate, process hosting the counter must utilize buffer and flow control.

  17. Nxtval Performance Experiments Strong Scaling 10 water molecules, (aDZ) 14 water molecules, (aDZ) 8 processes per node Percentage of overall execution time withinNxtvalincreases with scaling.

  18. Inspector/Executor Design 1. Inspector Calculate memory requirements Remove null tasks Collate task-list 2. Task Cost Estimator Two options: Use performance models Load gettimeofday() measurement from previous iteration(s) Deduce performance models off-line 3. Static Partitioner Partition into N groups where N is the number of MPI processes Minimize load balance according to cost estimations Write task list information for each proc/contraction to volatile memory 4. Executor Launch all tasks

  19. Performance Modeling - DGEMM DGEMM: A(m,k), B(k,n), and C(m,n) are 2D matrices and are scalar coefficients Our Performance Model: (mn) dot products of length k Corresponding (mn) store operations in C m loads of size k from A n loads of size k from B a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)

  20. Performance Modeling - DGEMM DGEMM: A(m,k), B(k,n), and C(m,n) are 2D matrices and are scalar coefficients Our Performance Model: (mn) dot products of length k Corresponding (mn) store operations in C m loads of size k from A n loads of size k from B a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)

  21. Performance Modeling TCE Sort Our Performance Model: TCE Sorts are actually matrix permutations 3rd order polynomial fit suffices Data always fits in L2 cache for this architecture Somewhat noisy measurements, but that s OK. (bytes)

  22. Largest Processing Time (LPT) Algorithm Polynomial time algorithm applied to an NP-Hard problem Proven 4/3 approximate by Richard Graham* 1. Sort tasks by cost in descending order 2. Assign to least loaded process so far *SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.

  23. Largest Processing Time (LPT) Algorithm Polynomial time algorithm applied to an NP-Hard problem Proven 4/3 approximate by Richard Graham* 1. Sort tasks by cost in descending order 2. Assign to least loaded process so far *SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.

  24. LPT - Binary Min Heap 1. Initialize a heap with N nodes (N = # of procs) each having zero cost. 2. Perform IncreaseMin() operationfor each new cost from the sorted list of tasks. IncreaseMin() is quite efficient because UpdateRoot() often occurs in O(1) time. Far more efficient than the na ve approach of iterating through an array to find the min. Execution time for this phase is negligible.

  25. LPT - Load Balance (a) Original with Nxtval Measured (b) Inspector/Executor with Nxtval Measured (c) LPT 1st iteration (c) LPT subsequent iterations

  26. Dynamic Buckets Design

  27. Dynamic Buckets Implementation

  28. Dynamic Buckets Load Balance (a) LPT Predicted (b) LPT Measured (c) Dynamic Buckets Predicted d) Dynamic Buckets Measured

  29. I/E Results Nitrogen - CCSDT Benzene - CCSD

  30. 10-H2O Cluster Results (DB) CCSD_t2_7_3 CCSD_t2_7

  31. Conclusions 1. Nxtval can be expensive at large scales 2. Static Partitioning can fix the problem, but has weaknesses: Requires performance model Noise degrades results 3. Dynamic Buckets is a viable alternative, and requires few changes to GA applications. 4. Solving load balance issues differs from problem to problem work needs to be done to pinpoint why and what to do about it.

  32. Future Work (Research) 1. Cyclops Tensor Framework (CTF) 2. DAG Scheduling of tensor contractions 3. What happens with accelerators (MIC/GPU)? 1. Performance model 2. Balancing load across both CPU and device 4. Comparison with hierarchical distributed load balancing, work stealing, etc. 5. Hypergraph partitioning / data locality

Related


More Related Content