Setting Up Parallel Universe in Your Pool - A Guide

slide1 n.w
1 / 27
Embed
Share

Learn about setting up a parallel universe in your pool using HTCondor, understand when to use it, and when not to use it based on job requirements. Explore examples, job life cycle, setup scripts, and more.

  • Parallel Universe
  • HTCondor
  • Job Execution
  • MPI
  • High Throughput Computing

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. PU! Setting up parallel universe in your pool and when (not!) to use it HTCondor Week 2018 Madison, WI Jason Patton (jpatton@cs.wisc.edu) Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

  2. Imagine some software Requires more resources than a single execute machine can provide, or Needs a list of machines prior to runtime, or Assumes child processes will run (and exit) on all machines at the same time Examples: MPI Master-Worker frameworks (some, not all) Server-Client testing (networking, database) 2

  3. What is parallel universe? All slots for a job are claimed by the dedicated scheduler before the job runs Each slot is given a node number ($(NODE)) Execution begins simultaneously By default, all slots terminate when the executable on the "Node 0 slot exits Slots share a single job ad and a spool directory on the submit machine (for condor_chirp) 3

  4. Use parallel universe when a job Cannot be made to fit on a single machine Needs a list of machines prior to runtime Needs simultaneous execution on slots Classic example: You have a MPI job that cannot fit on one machine, and you don t have a HPC cluster. Example helper script for Open MPI:openmpiscript 4

  5. Dont use parallel universe When submitting MPI jobs that could be made to fit on a single machine Break these up in to multicore vanilla universe jobs MPI works well on single machines (core binding, shared memory, single fs, etc.) 5

  6. Example parallel universe job life cycle 1.machine_count = 8 2. Dedicated scheduler claims idle slots (slots become Claimed/Idle) until it has 8 slots that match job requirements 3. Job execution begins on all slots simultaneously 4. Processes on all slots terminate when the process on node 0 exits 5. Slots return to Claimed/Idle state 6

  7. Example parallel universe job setup.sh universe = parallel #!/usr/bin/env bash executable = setup.sh arguments = $(NODE) transfer_input_files = master.sh,worker.sh node=$1 # check if on node 0 if (( $node == 0 )); then # run master program ./master.sh else # run worker program ./worker.sh fi output = out.$(CLUSTER).$(NODE) error = err.$(CLUSTER).$(NODE) log = log.$(CLUSTER) request_cpus = 1 request_memory = 1G machine_count = 8 queue queue 2? 7

  8. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Unclaimed Idle slot4@execute1 Claimed Busy slot1@execute2 Unclaimed Idle slot2@execute2 Unclaimed Idle slot3@execute2 Claimed Busy slot4@execute2 Unclaimed Idle slot1@execute3 Unclaimed Idle slot2@execute3 Unclaimed Idle Job Submitted 8

  9. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Unclaimed Idle slot4@execute1 Claimed Busy slot1@execute2 Unclaimed Idle slot2@execute2 Unclaimed Idle slot3@execute2 Claimed Busy slot4@execute2 Unclaimed Idle slot1@execute3 Unclaimed Idle slot2@execute3 Unclaimed Idle Job Submitted 9

  10. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Busy slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle Negotiation Cycle #1 10

  11. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Busy slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle Negotiation Cycle #2 11

  12. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Unclaimed Idle slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle 12

  13. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Idle slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle Negotiation Cycle #3 13

  14. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Idle slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle Negotiation Cycle #4 14

  15. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Idle slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle Negotiation Cycle #5 15

  16. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Unclaimed Idle slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Idle slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle 16

  17. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Idle slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Idle slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle Negotiation Cycle #6 17

  18. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Busy slot2@execute1 Claimed Busy slot3@execute1 Claimed Busy slot4@execute1 Claimed Busy slot1@execute2 Claimed Busy slot2@execute2 Claimed Busy slot3@execute2 Claimed Busy slot4@execute2 Claimed Busy slot1@execute3 Claimed Busy slot2@execute3 Claimed Busy Job Starts 18

  19. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Claimed Idle slot2@execute1 Claimed Busy slot3@execute1 Claimed Idle slot4@execute1 Claimed Idle slot1@execute2 Claimed Idle slot2@execute2 Claimed Idle slot3@execute2 Claimed Busy slot4@execute2 Claimed Idle slot1@execute3 Claimed Idle slot2@execute3 Claimed Idle Job Completes 19

  20. Example parallel universe job life cycle $ condor_status Name State Activity slot1@execute1 Unclaimed Idle slot2@execute1 Claimed Busy slot3@execute1 Unclaimed Idle slot4@execute1 Unclaimed Idle slot1@execute2 Unclaimed Idle slot2@execute2 Unclaimed Idle slot3@execute2 Claimed Busy slot4@execute2 Unclaimed Idle slot1@execute3 Unclaimed Idle slot2@execute3 Unclaimed Idle 10 minutes later 20

  21. Enabling parallel universe in your pool 1. Choose a submit machine to host the dedicated scheduler 2. Set DedicatedScheduler on participating execute machines 3. Adjust other settings (START, RANK, PREEMPT, etc.) to taste 4. Easy way modify the example config: condor_config.local.dedicated.resource 21

  22. Example config submit1.wisc.edu execute1.wisc.edu DedicatedScheduler = "DedicatedScheduler@submit1.wisc.edu" START = (Scheduler =?= $(DedicatedScheduler)) || ($(START)) PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT)) SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND)) RANK = Scheduler =?= $(DedicatedScheduler) 22

  23. Example config execute1.wisc.edu execute2.wisc.edu submit1.wisc.edu DedicatedScheduler = "DedicatedScheduler@submit1. wisc.edu" DedicatedScheduler = "DedicatedScheduler@submit1. wisc.edu" submit2.wisc.edu highmem.wisc.edu gpu.wisc.edu submit3.wisc.edu 23

  24. Dont enable parallel universe If you are particularly concerned about reduced throughput in your pool Claimed/Idle slots when PU jobs are being scheduled and completed The dedicated scheduler may not schedule dynamic slot claims efficiently If you re not careful about where PU jobs can land, slow networks can hurt performance, see ParallelSchedulingGroup in manual Preemption hurts total throughput if enabled 24

  25. Other config notes Can adjust how long dedicated scheduler holds on to Claimed/Idle slots UNUSED_CLAIM_TIMEOUT, see example condor_config.local.dedicated.submit PU jobs usually talk between slots, check firewall settings PU jobs may be sensitive to shared filesystems and user names 25

  26. Parallel Universe Trivia Can you submit PU jobs without your admin having configured your pool for them? No. (Yes, will sit idle while dedicated scheduler searches for nonexistent dedicated resources.) Should all MPI jobs use PU? No, only if they cannot fit on a single machine. Can you submit Docker jobs using PU? Yes! universe = docker WantParallelScheduling = true 26

  27. Questions? Example examples/location: /usr/share/doc/condor-8.7.8/examples 27

More Related Content