Swiss-Tx Class Supercomputers: Vision for System and Resource Management

undefined
 
Vision for System and Resource Management
of the Swiss-Tx class of Supercomputers
 
Josef Nemecek
ETH Zürich & Supercomputing Systems AG
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
2
 
Agenda
 
The Supercomputer Lifecycle then and now
The Swiss-T1 Management SW: 
COSMOS
Co
mmodity 
S
upercomputer 
M
anagement 
O
perating 
S
ystem
The goals of COSMOS
The concept of COSMOS
Implementation of COSMOS
Software Integration with existing Parts
Roadmap of 
COSMOS
09.03.2000
SOS Workshop 2000 (New Orleans, LA)
3
Supercomputers – Then and Now
 
Development by vendor
Hardware was hand-made
Software was tailored for hardware
Customers just had to order
out of the vendor’s catalogue
Test
Manage
Need
Order
$$$
09.03.2000
SOS Workshop 2000 (New Orleans, LA)
4
Supercomputers – Then and Now
 
System looks like a puzzle
Commodity parts, multiple vendors
Zoo of interacting software components
Individual system management
Millions of lines of code (scripts, daemons)
Simulation
Manage
Thought
Design
$$$ & t

 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
5
 
COSMOS – Goals
 
Integrated management for whole lifecycle
Design the supercomputer on-line
Simulate the supercomputer performance on-line
Build the designed and simulated supercomputer
Manage the built supercomputer
Complete run-time system management
Fault-tolerance on all (or most) system levels
Remote manageability of the whole supercomputer
Low run-time overhead for the system management
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
6
 
COSMOS – Supercomputer Design
 
Architecture selection
SAN technology
Nodes technology
Topology selection
Every topology has it’s +/–
Resource usage
Cost of the supercomputer
Space, electrical power
Performance estimation
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
7
 
COSMOS – Supercomputer Design
 
Architecture selection
SAN technology
Nodes technology
Topology selection
Every topology has it’s +/–
Resource usage
Cost of the supercomputer
Space, electrical power
Performance estimation
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
8
 
COSMOS – Supercomputer Design
 
Architecture selection
SAN technology
Nodes technology
Topology selection
Every topology has it’s +/–
Resource usage
Cost of the supercomputer
Space, electrical power
Performance estimation
09.03.2000
SOS Workshop 2000 (New Orleans, LA)
9
COSMOS – Supercomputer Design
Architecture selection
SAN technology
Nodes technology
Topology selection
Every topology has it’s +/–
Resource usage
Cost of the supercomputer
Space, electrical power
Performance estimation
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
10
 
COSMOS – Goals
 
Single-system view of whole system
Allows one-point system management
Allows remote system management
High availability of the system management
Allows high over-all system up-times
Allows dynamic configuration changes
Modular software design
System-independent concept & design
Interfaces to existing management software modules
09.03.2000
SOS Workshop 2000 (New Orleans, LA)
11
COSMOS – Concept
Configuration
Control the system
Monitoring
Observe the system
Planning
When? Who? What?
Security
Stability & independence
Faults & Traps
Help the system
Accounting
Charge the usage
 
Complete, integrated system management
Remote management from everywhere
No administrative programming necessary
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
12
 
COSMOS – Implementation
System Management
Node Management
SAN Management
Process Management
Resource Management
Storage Management
LAN Management
User Interface
State control and monitoring
of the nodes, accounting
SAN-dependent management
and monitoring, accounting
Support of and co-operation with
parallel environments as MPI/FCI
Resource management:
Priorities, allocation, queues
Vendor-dependent storage
management software
SNMP-based management of
used LAN components
User-privilege-based
management and monitoring
09.03.2000
SOS Workshop 2000 (New Orleans, LA)
13
COSMOS – Implementation
Management Center
COSMOS Center
Node 0
COSMOS Agent
Node 1
COSMOS Agent
Node 3
COSMOS Agent
Node 2
COSMOS Agent
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
14
 
Gridware GRD/Codine
 
Powerful resource management
Integrates resource and batch management
Ticket-based job scheduling scheme
Well-defined interfaces
Some drawbacks at this moment
GRD/Codine is not topology-aware
GRD/Codine is a commercial product
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
15
 
COSMOS – Interaction with GRD/Codine
System Management
Node Management
SAN Management
Process Management
Storage Management
LAN Management
User Interface
GRD/Codine
Node Monitoring
Process Monitoring
Resource Management
User Interface
Accounting
Resource Management
 
09.03.2000
 
SOS Workshop 2000 (New Orleans, LA)
 
16
 
Roadmap of COSMOS Development
 
Prototype release plan for COSMOS
1Q2000
 
– Centralised process and SAN management
2Q2000
 
– Distributed system management framework
3Q2000
 
– Complete non-interactive management
4Q2000
 
– Complete interactive management
Interaction between COSMOS & GRD/Codine
Transfer of topology and configuration information
Exchange of monitoring information
undefined
 
Vision for System and Resource Management
of the Swiss-Tx class of Supercomputers
 
Josef Nemecek
ETH Zürich & Supercomputing Systems AG
Slide Note
Embed
Share

Explore the visionary approach by Josef Nemecek from ETH Zurich & Supercomputing Systems AG for managing the Swiss-Tx class of supercomputers. Delve into the evolution of supercomputer lifecycle management, the COSMOS operating system, integrated management goals, and the design architecture for efficient resource allocation and performance estimation.

  • Supercomputers
  • System Management
  • Resource Allocation
  • COSMOS Operating System
  • Swiss-Tx Class

Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Z rich & Supercomputing Systems AG

  2. Agenda The Supercomputer Lifecycle then and now The Swiss-T1 Management SW: COSMOS Commodity Supercomputer Management Operating System The goals of COSMOS The concept of COSMOS Implementation of COSMOS Software Integration with existing Parts Roadmap of COSMOS 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 2

  3. Supercomputers Then and Now Development by vendor Hardware was hand-made Software was tailored for hardware Customers just had to order out of the vendor s catalogue $$$ Need Order Test Manage 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 3

  4. Supercomputers Then and Now System looks like a puzzle Commodity parts, multiple vendors Zoo of interacting software components Individual system management Millions of lines of code (scripts, daemons) $$$ & t Architecture Needs Thought Design Simulation Manage Topology Specification 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 4

  5. COSMOS Goals Integrated management for whole lifecycle Design the supercomputer on-line Simulate the supercomputer performance on-line Build the designed and simulated supercomputer Manage the built supercomputer Complete run-time system management Fault-tolerance on all (or most) system levels Remote manageability of the whole supercomputer Low run-time overhead for the system management 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 5

  6. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 6

  7. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 7

  8. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 8

  9. COSMOS Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has it s +/ Resource usage Cost of the supercomputer Space, electrical power Performance estimation 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 9

  10. COSMOS Goals Single-system view of whole system Allows one-point system management Allows remote system management High availability of the system management Allows high over-all system up-times Allows dynamic configuration changes Modular software design System-independent concept & design Interfaces to existing management software modules 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 10

  11. COSMOS Concept Configuration Control the system Monitoring Observe the system Planning When? Who? What? Security Stability & independence Faults & Traps Help the system Accounting Charge the usage Complete, integrated system management Remote management from everywhere No administrative programming necessary 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 11

  12. COSMOS Implementation User-privilege-based management and monitoring User Interface System Management State control and monitoring of the nodes, accounting Node Management SAN-dependent management and monitoring, accounting SAN Management Resource management: Priorities, allocation, queues Resource Management Support of and co-operation with parallel environments as MPI/FCI Process Management SNMP-based management of used LAN components LAN Management Vendor-dependent storage management software Storage Management 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 12

  13. COSMOS Implementation Node 3 Node 0 Management Center COSMOS Agent COSMOS Center COSMOS Agent Process 6 Process 0 Management Center Process 7 COSMOS Center Process 1 Node 2 Node 1 Management Center COSMOS Agent COSMOS Center COSMOS Agent Process 4 Process 2 Process 5 Process 3 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 13

  14. Gridware GRD/Codine Powerful resource management Integrates resource and batch management Ticket-based job scheduling scheme Well-defined interfaces Some drawbacks at this moment GRD/Codine is not topology-aware GRD/Codine is a commercial product 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 14

  15. COSMOS Interaction with GRD/Codine User Interface User Interface System Management Node Management Node Monitoring GRD/Codine SAN Management Accounting Resource Management Resource Management Process Management Process Monitoring LAN Management Storage Management 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 15

  16. Roadmap of COSMOS Development Prototype release plan for COSMOS 1Q2000 Centralised process and SAN management 2Q2000 Distributed system management framework 3Q2000 Complete non-interactive management 4Q2000 Complete interactive management Interaction between COSMOS & GRD/Codine Transfer of topology and configuration information Exchange of monitoring information 09.03.2000 SOS Workshop 2000 (New Orleans, LA) 16

  17. Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Z rich & Supercomputing Systems AG

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#