Understanding HTCondor Grid Universe: Job Management Capabilities and Submit File Configuration

Slide Note
Embed
Share

Explore the HTCondor Grid Universe, which offers job management capabilities for high throughput computing. Learn about job delegation, fault tolerance, and different back end types supported. Understand how to configure submit files for the Grid Universe and differences from normal HTCondor jobs.


Uploaded on Sep 21, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. HTCondors Grid Universe ISGC 2019 Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

  2. HTCondor Grid Universe Jobs HTCondor for the grid Same job management capabilities as a local HTCondor pool Use other scheduling systems resources Sometimes referred to as Condor-G

  3. Job Management Interface Local, persistent job queue Job policy expressions Job activity logs Workflows (DAGMan) Fault tolerance

  4. Grid Universe All handled in your submit file Supports a number of back end types: Globus GRAM CREAM NorduGrid ARC HTCondor PBS LSF SGE SLURM UNICORE SSH Amazon EC2 Google Compute Engine Microsoft Azure BOINC Future: Kubernetes ???

  5. Grid is a Misnomer Grid universe used any time job is submitted to a different job management/queuing system Grid services GRAM, CREAM, ARC Different HTCondor schedd Local batch system PBS, LSF, SLURM, SGE Cloud service EC2, Google Compute Engine, Azure Volunteer computing BOINC 5

  6. Grid Universe Submit File http://htcondor.org/manual/v8.8/TheGridUniverse.html Need a lines Universe = grid GridResource = <grid type> <server> Specify service type and server location Examples condor submit.foo.edu cm.foo.edu gt5 portal.foo.edu/jobmanager-pbs cream creamce.foo.edu/cream-pbs-glow nordugrid arc.foo.edu batch pbs 6

  7. Vanilla Universe Submit File executable = my_program.exe input = in.dat output = out.dat queue 1 7

  8. Grid Universe Submit File universe = grid gridresource = batch slurm executable = my_program.exe input = in.dat output = out.dat queue 1 8

  9. Differences from Normal HTCondor Jobs Vanilla universe Resource acquisition Job sent directly to machine that can start running it immediately Grid universe Job delegation Job sent to alternate scheduling system Job may sit idle while resources available elsewhere 9

  10. Differences from Normal HTCondor Jobs No matchmaking Specify destination with GridResource No Requirements, Rank Resource requests often ignored Run-time features unavailable condor_ssh_to_job condor_tail condor_chirp 10

  11. Differences from Normal HTCondor Jobs Information about job execution may be lacking Job exit code Resource usage 11

  12. Grid Job Attributes GridJobId Job ID from remote server GridJobStatus Job status from remote server LastRemoteStatusUpdate Time job status last checked GridResourceUnavailableTime Time when remote server went down 12

  13. Two common uses for Grid Universe Jobs 1. Pilot Factories (e.g. glideinWMS) will submit a grid universe job to send pilot jobs to remote site (leverage common interface) 2. HTCondor-CE will route incoming jobs to a site s local scheduler by submitting a grid universe job 13

  14. Gridmanager Daemon Runs under the schedd Similar to the shadow Handles all management of grid jobs Single instance manages all grid jobs for a user

  15. Grid ASCII Helper Protocol (GAHP) Runs under gridmanager Encapsulates grid client libraries in separate process Simple ASCII protocol Easy to use client libraries when they can t be linked directly with gridmanager http://htcondor.org/gahp/

  16. How It Works Pilot Factory Grid Resource CREAM Schedd LSF

  17. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF

  18. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager

  19. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager GAHP

  20. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager GAHP

  21. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager User Job GAHP

  22. HTCondor-CE use of Grid Universe Jobs CE Host 1. Grid Job Job Router CE Schedd 2. Routed Job 3. Start Gridmanager 4. qsub, sbatch, etc. Gridmanager 22

  23. Network Connectivity Outbound connections only for most job types GRAM requires incoming connections Need 2 open ports per <user, X509 DN> pair

  24. Authentication GAHP acts as the user with the remote service Destination service is local UID-based authentication E.g. PBS, LSF, SLURM, HTCondor Destination service is remote X.509 proxy (possibly with VOMS attributes) Automatically forward refreshed proxy E.g. HTCondor CE, GRAM, CREAM, ARC CE 24

  25. HELD Status Jobs will be held when HTCondor needs help with an error On release, HTCondor will retry The reason for the hold will be saved in the job ad and user log

  26. Common Errors Authentication Hold reason may be misleading User may not be authorized by CE Condor-G may not have access to all Certificate Authority files User s proxy may have expired

  27. Common Errors CE no longer knows about job CE admin may forcibly remove job files Condor-G is obsessive about not leaving orphaned jobs May need to take extra steps to convince Condor-G that remote job is gone

  28. Thank You! 28

  29. Throttles and Timeouts Limits that prevent Condor-G or CEs from being overwhelmed by large numbers of jobs Defaults are fairly conservative

  30. Throttles and Timeouts GRIDMANAGER_MAX_SUBMITTED_JOBS_PER _RESOURCE = 1000 You can increase to 10,000 or more GRIDMANAGER_JOB_PROBE_INTERVAL Default is 60 seconds Can decrease, but not recommended

  31. Throttles and Timeouts GRIDMANAGER_MAX_PENDING_REQUESTS = 50 Number of commands sent to a GAHP in parallel Can increase to a couple hundred GRIDMANAGER_GAHP_CALL_TIMEOUT = 300 Time after which a GAHP command is considered failed May need to lengthen if pending requests is increased

Related