Swiss-Tx Class Supercomputers: Vision for System and Resource Management

Slide Note
Embed
Share

Explore the visionary approach by Josef Nemecek from ETH Zurich & Supercomputing Systems AG for managing the Swiss-Tx class of supercomputers. Delve into the evolution of supercomputer lifecycle management, the COSMOS operating system, integrated management goals, and the design architecture for efficient resource allocation and performance estimation.


Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Z rich & Supercomputing Systems AG

  2. Agenda The Supercomputer Lifecycle then and now The Swiss-T1 Management SW: COSMOS Commodity Supercomputer Management Operating System The goals of COSMOS The concept of COSMOS Implementation of COSMOS Software Integration with existing Parts Roadmap of COSMOS 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 2

  3. Supercomputers Then and Now Development by vendor Hardware was hand-made Software was tailored for hardware Customers just had to order out of the vendor s catalogue $$$ Need Order Test Manage 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 3

  4. Supercomputers Then and Now System looks like a puzzle Commodity parts, multiple vendors Zoo of interacting software components Individual system management Millions of lines of code (scripts, daemons) $$$ & t Architecture Needs Thought Design Simulation Manage Topology Specification 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 4

  5. COSMOS Goals Integrated management for whole lifecycle Design the supercomputer on-line Simulate the supercomputer performance on-line Build the designed and simulated supercomputer Manage the built supercomputer Complete run-time system management Fault-tolerance on all (or most) system levels Remote manageability of the whole supercomputer Low run-time overhead for the system management 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 5

  6. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 6

  7. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 7

  8. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 8

  9. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 9

  10. COSMOS Goals Single-system view of whole system Allows one-point system management Allows remote system management High availability of the system management Allows high over-all system up-times Allows dynamic configuration changes Modular software design System-independent concept & design Interfaces to existing management software modules 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 10

  11. COSMOS Concept Configuration Control the system Monitoring Observe the system Planning When? Who? What? Security Stability & independence Faults & Traps Help the system Accounting Charge the usage Complete, integrated system management Remote management from everywhere No administrative programming necessary 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 11

  12. COSMOS Implementation User-privilege-based management and monitoring User Interface System Management State control and monitoring of the nodes, accounting Node Management SAN-dependent management and monitoring, accounting SAN Management Resource management: Priorities, allocation, queues Resource Management Support of and co-operation with parallel environments as MPI/FCI Process Management SNMP-based management of used LAN components LAN Management Vendor-dependent storage management software Storage Management 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 12

  13. COSMOS Implementation Node 3 Node 0 Management Center COSMOS Agent COSMOS Center COSMOS Agent Process 6 Process 0 Management Center Process 7 COSMOS Center Process 1 Node 2 Node 1 Management Center COSMOS Agent COSMOS Center COSMOS Agent Process 4 Process 2 Process 5 Process 3 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 13

  14. Gridware GRD/Codine Powerful resource management Integrates resource and batch management Ticket-based job scheduling scheme Well-defined interfaces Some drawbacks at this moment GRD/Codine is not topology-aware GRD/Codine is a commercial product 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 14

  15. COSMOS Interaction with GRD/Codine User Interface User Interface System Management Node Management Node Monitoring GRD/Codine SAN Management Accounting Resource Management Resource Management Process Management Process Monitoring LAN Management Storage Management 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 15

  16. Roadmap of COSMOS Development Prototype release plan for COSMOS 1Q2000 Centralised process and SAN management 2Q2000 Distributed system management framework 3Q2000 Complete non-interactive management 4Q2000 Complete interactive management Interaction between COSMOS & GRD/Codine Transfer of topology and configuration information Exchange of monitoring information 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 16

  17. Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Z rich & Supercomputing Systems AG

More Related Content