HTCondor Grid Universe: Job Management Capabilities and Submit File Configuration

undefined

’

›

HTCondor for the grid

Same job management capabilities as a local

HTCondor pool

Use other scheduling systems’ resources

›

Sometimes referred to as “Condor-G”

›

Local, persistent job queue

›

Job policy expressions

›

Job activity logs

›

Workflows (DAGMan)

›

Fault tolerance

“

”

›

All handled in your submit file

›

Supports a number of “back end” types:

Globus GRAM

CREAM

NorduGrid ARC

HTCondor

PBS

LSF

SGE

SLURM

UNICORE

SSH

Amazon EC2

Google Compute Engine

Microsoft Azure

BOINC

Future: Kubernetes ???

›

Grid universe used any time job is submitted to a

different job management/queuing system

Grid services

•

GRAM, CREAM, ARC

Different HTCondor schedd

Local batch system

•

PBS, LSF, SLURM, SGE

Cloud service

•

EC2, Google Compute Engine, Azure

Volunteer computing

•

BOINC

“

”

›

Need a lines

Universe = grid

GridResource = “<grid type> <server> …”

›

Specify service type and server location

›

Examples

condor submit.foo.edu cm.foo.edu

gt5 portal.foo.edu/jobmanager-pbs

cream creamce.foo.edu/cream-pbs-glow

nordugrid arc.foo.edu

batch pbs

http://htcondor.org/manual/v8.8/TheGridUniverse.html

executable = my_program.exe

input = in.dat

output = out.dat

queue 1

universe = grid

gridresource = batch slurm

executable = my_program.exe

input = in.dat

output = out.dat

queue 1

›

Vanilla universe

Resource acquisition

Job sent directly to machine that can start

running it immediately

›

Grid universe

Job delegation

Job sent to alternate scheduling system

Job may sit idle while resources available

elsewhere

›

No matchmaking

Specify destination with GridResource

No Requirements, Rank

Resource requests often ignored

›

Run-time features unavailable

condor_ssh_to_job

condor_tail

condor_chirp

›

Information about job execution may be

lacking

Job exit code

Resource usage

›

GridJobId

Job ID from remote server

›

GridJobStatus

Job status from remote server

›

LastRemoteStatusUpdate

Time job status last checked

›

GridResourceUnavailableTime

Time when remote server went down

1.

Pilot Factories (e.g. glideinWMS) will

submit a grid universe job to send pilot

jobs to remote site (leverage common

interface)

2.

HTCondor-CE will “route” incoming jobs to

a site’s local scheduler by submitting a

grid universe job

›

Runs under the schedd

›

Similar to the shadow

›

Handles all management of grid jobs

›

Single instance manages all grid jobs for a

user

›

Runs under gridmanager

›

Encapsulates grid client libraries

in separate process

›

Simple ASCII protocol

›

Easy to use client libraries when

they can’t be linked directly with

gridmanager

Pilot Factory

Grid Resource

Pilot Factory

Grid Resource

Pilot Factory

Grid Resource

Pilot Factory

Grid Resource

Pilot Factory

Grid Resource

Pilot Factory

Grid Resource

›

Outbound connections only for most job

types

›

GRAM requires incoming connections

Need 2 open ports per <user, X509 DN> pair

›

GAHP acts as the user with the remote

service

›

Destination service is local

UID-based authentication

E.g. PBS, LSF, SLURM, HTCondor

›

Destination service is remote

X.509 proxy (possibly with VOMS attributes)

Automatically forward refreshed proxy

E.g. HTCondor CE, GRAM, CREAM, ARC CE

›

Jobs will be held when HTCondor needs

help with an error

›

On release, HTCondor will retry

›

The reason for the hold will be saved in the

job ad and user log

›

Authentication

Hold reason may be misleading

User may not be authorized by CE

Condor-G may not have access to all

Certificate Authority files

User’s proxy may have expired

›

CE no longer knows about job

CE admin may forcibly remove job files

Condor-G is obsessive about not leaving

orphaned jobs

May need to take extra steps to convince

Condor-G that remote job is gone

›

Limits that prevent Condor-G or CEs from

being overwhelmed by large numbers of

jobs

›

Defaults are fairly conservative

›

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER

_RESOURCE = 1000

You can increase to 10,000 or more

›

GRIDMANAGER_JOB_PROBE_INTERVAL

Default is 60 seconds

Can decrease, but not recommended

›

GRIDMANAGER_MAX_PENDING_REQUESTS =

Number of commands sent to a GAHP in parallel

Can increase to a couple hundred

›

GRIDMANAGER_GAHP_CALL_TIMEOUT =

Time after which a GAHP command is considered

failed

May need to lengthen if pending requests is

increased

Slide Note

Embed Share

Download

Explore the HTCondor Grid Universe, which offers job management capabilities for high throughput computing. Learn about job delegation, fault tolerance, and different back end types supported. Understand how to configure submit files for the Grid Universe and differences from normal HTCondor jobs.

coventry_n Follow

Uploaded on Sep 21, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

HTCondors Grid Universe ISGC 2019 Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

HTCondor Grid Universe Jobs HTCondor for the grid Same job management capabilities as a local HTCondor pool Use other scheduling systems resources Sometimes referred to as Condor-G

Job Management Interface Local, persistent job queue Job policy expressions Job activity logs Workflows (DAGMan) Fault tolerance

Grid Universe All handled in your submit file Supports a number of back end types: Globus GRAM CREAM NorduGrid ARC HTCondor PBS LSF SGE SLURM UNICORE SSH Amazon EC2 Google Compute Engine Microsoft Azure BOINC Future: Kubernetes ???

Grid is a Misnomer Grid universe used any time job is submitted to a different job management/queuing system Grid services GRAM, CREAM, ARC Different HTCondor schedd Local batch system PBS, LSF, SLURM, SGE Cloud service EC2, Google Compute Engine, Azure Volunteer computing BOINC 5

Grid Universe Submit File http://htcondor.org/manual/v8.8/TheGridUniverse.html Need a lines Universe = grid GridResource = <grid type> <server> Specify service type and server location Examples condor submit.foo.edu cm.foo.edu gt5 portal.foo.edu/jobmanager-pbs cream creamce.foo.edu/cream-pbs-glow nordugrid arc.foo.edu batch pbs 6

Vanilla Universe Submit File executable = my_program.exe input = in.dat output = out.dat queue 1 7

Grid Universe Submit File universe = grid gridresource = batch slurm executable = my_program.exe input = in.dat output = out.dat queue 1 8

Differences from Normal HTCondor Jobs Vanilla universe Resource acquisition Job sent directly to machine that can start running it immediately Grid universe Job delegation Job sent to alternate scheduling system Job may sit idle while resources available elsewhere 9

Differences from Normal HTCondor Jobs No matchmaking Specify destination with GridResource No Requirements, Rank Resource requests often ignored Run-time features unavailable condor_ssh_to_job condor_tail condor_chirp 10

Differences from Normal HTCondor Jobs Information about job execution may be lacking Job exit code Resource usage 11

Grid Job Attributes GridJobId Job ID from remote server GridJobStatus Job status from remote server LastRemoteStatusUpdate Time job status last checked GridResourceUnavailableTime Time when remote server went down 12

Two common uses for Grid Universe Jobs 1. Pilot Factories (e.g. glideinWMS) will submit a grid universe job to send pilot jobs to remote site (leverage common interface) 2. HTCondor-CE will route incoming jobs to a site s local scheduler by submitting a grid universe job 13

Gridmanager Daemon Runs under the schedd Similar to the shadow Handles all management of grid jobs Single instance manages all grid jobs for a user

Grid ASCII Helper Protocol (GAHP) Runs under gridmanager Encapsulates grid client libraries in separate process Simple ASCII protocol Easy to use client libraries when they can t be linked directly with gridmanager http://htcondor.org/gahp/

How It Works Pilot Factory Grid Resource CREAM Schedd LSF

How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF

How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager

How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager GAHP

How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager GAHP

How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager User Job GAHP

HTCondor-CE use of Grid Universe Jobs CE Host 1. Grid Job Job Router CE Schedd 2. Routed Job 3. Start Gridmanager 4. qsub, sbatch, etc. Gridmanager 22

Network Connectivity Outbound connections only for most job types GRAM requires incoming connections Need 2 open ports per <user, X509 DN> pair

Authentication GAHP acts as the user with the remote service Destination service is local UID-based authentication E.g. PBS, LSF, SLURM, HTCondor Destination service is remote X.509 proxy (possibly with VOMS attributes) Automatically forward refreshed proxy E.g. HTCondor CE, GRAM, CREAM, ARC CE 24

HELD Status Jobs will be held when HTCondor needs help with an error On release, HTCondor will retry The reason for the hold will be saved in the job ad and user log

Common Errors Authentication Hold reason may be misleading User may not be authorized by CE Condor-G may not have access to all Certificate Authority files User s proxy may have expired

Common Errors CE no longer knows about job CE admin may forcibly remove job files Condor-G is obsessive about not leaving orphaned jobs May need to take extra steps to convince Condor-G that remote job is gone

Thank You! 28

Throttles and Timeouts Limits that prevent Condor-G or CEs from being overwhelmed by large numbers of jobs Defaults are fairly conservative

Throttles and Timeouts GRIDMANAGER_MAX_SUBMITTED_JOBS_PER _RESOURCE = 1000 You can increase to 10,000 or more GRIDMANAGER_JOB_PROBE_INTERVAL Default is 60 seconds Can decrease, but not recommended

Throttles and Timeouts GRIDMANAGER_MAX_PENDING_REQUESTS = 50 Number of commands sent to a GAHP in parallel Can increase to a couple hundred GRIDMANAGER_GAHP_CALL_TIMEOUT = 300 Time after which a GAHP command is considered failed May need to lengthen if pending requests is increased

HTCondor Grid Universe: Job Management Capabilities and Submit File Configuration

Download Presentation

Presentation Transcript

Related

More Related Content