HTCondor Grid Universe: Job Management Capabilities and Submit File Configuration

undefined
 
H
T
C
o
n
d
o
r
s
 
G
r
i
d
 
U
n
i
v
e
r
s
e
 
I
S
G
C
 
2
0
1
9
 
C
e
n
t
e
r
 
f
o
r
 
H
i
g
h
 
T
h
r
o
u
g
h
p
u
t
 
C
o
m
p
u
t
i
n
g
D
e
p
a
r
t
m
e
n
t
 
o
f
 
C
o
m
p
u
t
e
r
 
S
c
i
e
n
c
e
s
U
n
i
v
e
r
s
i
t
y
 
o
f
 
W
i
s
c
o
n
s
i
n
-
M
a
d
i
s
o
n
 
H
T
C
o
n
d
o
r
 
G
r
i
d
 
U
n
i
v
e
r
s
e
 
J
o
b
s
 
HTCondor for the grid
h
Same job management capabilities as a local
HTCondor pool
h
Use other scheduling systems’ resources
Sometimes referred to as “Condor-G”
 
J
o
b
 
M
a
n
a
g
e
m
e
n
t
 
I
n
t
e
r
f
a
c
e
 
Local, persistent job queue
Job policy expressions
Job activity logs
Workflows (DAGMan)
Fault tolerance
 
G
r
i
d
 
U
n
i
v
e
r
s
e
 
All handled in your submit file
Supports a number of “back end” types:
h
Globus GRAM
h
CREAM
h
NorduGrid ARC
h
HTCondor
h
PBS
h
LSF
h
SGE
h
SLURM
h
UNICORE
h
SSH
h
Amazon EC2
h
Google Compute Engine
h
Microsoft Azure
h
BOINC
h
Future: Kubernetes ???
 
Grid universe used any time job is submitted to a
different job management/queuing system
h
Grid services
GRAM, CREAM, ARC
h
Different HTCondor schedd
h
Local batch system
PBS, LSF, SLURM, SGE
h
Cloud service
EC2, Google Compute Engine, Azure
h
Volunteer computing
BOINC
 
G
r
i
d
 
i
s
 
a
 
M
i
s
n
o
m
e
r
 
5
 
Need a lines
h
Universe = grid
h
GridResource = “<grid type> <server> …”
Specify service type and server location
Examples
h
condor submit.foo.edu cm.foo.edu
h
gt5 portal.foo.edu/jobmanager-pbs
h
cream creamce.foo.edu/cream-pbs-glow
h
nordugrid arc.foo.edu
h
batch pbs
 
G
r
i
d
 
U
n
i
v
e
r
s
e
 
S
u
b
m
i
t
 
F
i
l
e
 
6
 
http://htcondor.org/manual/v8.8/TheGridUniverse.html
 
V
a
n
i
l
l
a
 
U
n
i
v
e
r
s
e
 
S
u
b
m
i
t
 
F
i
l
e
 
7
 
executable = my_program.exe
input = in.dat
output = out.dat
queue 1
 
G
r
i
d
 
U
n
i
v
e
r
s
e
 
S
u
b
m
i
t
 
F
i
l
e
 
8
 
universe = grid
gridresource = batch slurm
executable = my_program.exe
input = in.dat
output = out.dat
queue 1
 
Vanilla universe
h
Resource acquisition
h
Job sent directly to machine that can start
running it immediately
Grid universe
h
Job delegation
h
Job sent to alternate scheduling system
h
Job may sit idle while resources available
elsewhere
 
D
i
f
f
e
r
e
n
c
e
s
 
f
r
o
m
 
N
o
r
m
a
l
H
T
C
o
n
d
o
r
 
J
o
b
s
 
9
 
No matchmaking
h
Specify destination with GridResource
h
No Requirements, Rank
h
Resource requests often ignored
Run-time features unavailable
h
condor_ssh_to_job
h
condor_tail
h
condor_chirp
 
D
i
f
f
e
r
e
n
c
e
s
 
f
r
o
m
 
N
o
r
m
a
l
H
T
C
o
n
d
o
r
 
J
o
b
s
 
10
 
Information about job execution may be
lacking
h
Job exit code
h
Resource usage
 
D
i
f
f
e
r
e
n
c
e
s
 
f
r
o
m
 
N
o
r
m
a
l
H
T
C
o
n
d
o
r
 
J
o
b
s
 
11
 
GridJobId
h
Job ID from remote server
GridJobStatus
h
Job status from remote server
LastRemoteStatusUpdate
h
Time job status last checked
GridResourceUnavailableTime
h
Time when remote server went down
 
G
r
i
d
 
J
o
b
 
A
t
t
r
i
b
u
t
e
s
 
12
 
1.
Pilot Factories (e.g. glideinWMS) will
submit a grid universe job to send pilot
jobs to remote site (leverage common
interface)
2.
HTCondor-CE will “route” incoming jobs to
a site’s local scheduler by submitting a
grid universe job
 
T
w
o
 
c
o
m
m
o
n
 
u
s
e
s
 
f
o
r
G
r
i
d
 
U
n
i
v
e
r
s
e
 
J
o
b
s
 
13
 
G
r
i
d
m
a
n
a
g
e
r
 
D
a
e
m
o
n
 
Runs under the schedd
Similar to the shadow
Handles all management of grid jobs
Single instance manages all grid jobs for a
user
 
G
r
i
d
 
A
S
C
I
I
 
H
e
l
p
e
r
 
P
r
o
t
o
c
o
l
(
G
A
H
P
)
 
Runs under gridmanager
Encapsulates grid client libraries
in separate process
Simple ASCII protocol
Easy to use client libraries when
they can’t be linked directly with
gridmanager
 
 
h
t
t
p
:
/
/
h
t
c
o
n
d
o
r
.
o
r
g
/
g
a
h
p
/
 
H
o
w
 
I
t
 
W
o
r
k
s
 
 
Pilot Factory
 
Grid Resource
 
H
o
w
 
I
t
 
W
o
r
k
s
 
 
Pilot Factory
 
Grid Resource
 
H
o
w
 
I
t
 
W
o
r
k
s
 
 
Pilot Factory
 
Grid Resource
 
H
o
w
 
I
t
 
W
o
r
k
s
 
 
Pilot Factory
 
Grid Resource
 
H
o
w
 
I
t
 
W
o
r
k
s
 
 
Pilot Factory
 
Grid Resource
 
H
o
w
 
I
t
 
W
o
r
k
s
 
 
Pilot Factory
 
Grid Resource
 
H
T
C
o
n
d
o
r
-
C
E
 
u
s
e
 
o
f
G
r
i
d
 
U
n
i
v
e
r
s
e
 
J
o
b
s
 
22
 
N
e
t
w
o
r
k
 
C
o
n
n
e
c
t
i
v
i
t
y
 
Outbound connections only for most job
types
GRAM requires incoming connections
h
Need 2 open ports per <user, X509 DN> pair
 
GAHP acts as the user with the remote
service
Destination service is local
h
UID-based authentication
h
E.g. PBS, LSF, SLURM, HTCondor
Destination service is remote
h
X.509 proxy (possibly with VOMS attributes)
h
Automatically forward refreshed proxy
h
E.g. HTCondor CE, GRAM, CREAM, ARC CE
 
A
u
t
h
e
n
t
i
c
a
t
i
o
n
 
24
 
H
E
L
D
 
S
t
a
t
u
s
 
Jobs will be held when HTCondor needs
help with an error
On release, HTCondor will retry
The reason for the hold will be saved in the
job ad and user log
 
C
o
m
m
o
n
 
E
r
r
o
r
s
 
Authentication
h
Hold reason may be misleading
h
User may not be authorized by CE
h
Condor-G may not have access to all
Certificate Authority files
h
User’s proxy may have expired
 
C
o
m
m
o
n
 
E
r
r
o
r
s
 
CE no longer knows about job
h
CE admin may forcibly remove job files
h
Condor-G is obsessive about not leaving
orphaned jobs
h
May need to take extra steps to convince
Condor-G that remote job is gone
 
T
h
a
n
k
 
Y
o
u
!
 
28
 
T
h
r
o
t
t
l
e
s
 
a
n
d
 
T
i
m
e
o
u
t
s
 
Limits that prevent Condor-G or CEs from
being overwhelmed by large numbers of
jobs
Defaults are fairly conservative
 
T
h
r
o
t
t
l
e
s
 
a
n
d
 
T
i
m
e
o
u
t
s
 
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER
_RESOURCE = 1000
h
You can increase to 10,000 or more
GRIDMANAGER_JOB_PROBE_INTERVAL
h
Default is 60 seconds
h
Can decrease, but not recommended
 
T
h
r
o
t
t
l
e
s
 
a
n
d
 
T
i
m
e
o
u
t
s
 
GRIDMANAGER_MAX_PENDING_REQUESTS =
50
h
Number of commands sent to a GAHP in parallel
h
Can increase to a couple hundred
GRIDMANAGER_GAHP_CALL_TIMEOUT =
300
h
Time after which a GAHP command is considered
failed
h
May need to lengthen if pending requests is
increased
Slide Note
Embed
Share

Explore the HTCondor Grid Universe, which offers job management capabilities for high throughput computing. Learn about job delegation, fault tolerance, and different back end types supported. Understand how to configure submit files for the Grid Universe and differences from normal HTCondor jobs.

  • HTCondor
  • Job Management
  • Grid Universe
  • High Throughput Computing

Uploaded on Sep 21, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. HTCondors Grid Universe ISGC 2019 Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

  2. HTCondor Grid Universe Jobs HTCondor for the grid Same job management capabilities as a local HTCondor pool Use other scheduling systems resources Sometimes referred to as Condor-G

  3. Job Management Interface Local, persistent job queue Job policy expressions Job activity logs Workflows (DAGMan) Fault tolerance

  4. Grid Universe All handled in your submit file Supports a number of back end types: Globus GRAM CREAM NorduGrid ARC HTCondor PBS LSF SGE SLURM UNICORE SSH Amazon EC2 Google Compute Engine Microsoft Azure BOINC Future: Kubernetes ???

  5. Grid is a Misnomer Grid universe used any time job is submitted to a different job management/queuing system Grid services GRAM, CREAM, ARC Different HTCondor schedd Local batch system PBS, LSF, SLURM, SGE Cloud service EC2, Google Compute Engine, Azure Volunteer computing BOINC 5

  6. Grid Universe Submit File http://htcondor.org/manual/v8.8/TheGridUniverse.html Need a lines Universe = grid GridResource = <grid type> <server> Specify service type and server location Examples condor submit.foo.edu cm.foo.edu gt5 portal.foo.edu/jobmanager-pbs cream creamce.foo.edu/cream-pbs-glow nordugrid arc.foo.edu batch pbs 6

  7. Vanilla Universe Submit File executable = my_program.exe input = in.dat output = out.dat queue 1 7

  8. Grid Universe Submit File universe = grid gridresource = batch slurm executable = my_program.exe input = in.dat output = out.dat queue 1 8

  9. Differences from Normal HTCondor Jobs Vanilla universe Resource acquisition Job sent directly to machine that can start running it immediately Grid universe Job delegation Job sent to alternate scheduling system Job may sit idle while resources available elsewhere 9

  10. Differences from Normal HTCondor Jobs No matchmaking Specify destination with GridResource No Requirements, Rank Resource requests often ignored Run-time features unavailable condor_ssh_to_job condor_tail condor_chirp 10

  11. Differences from Normal HTCondor Jobs Information about job execution may be lacking Job exit code Resource usage 11

  12. Grid Job Attributes GridJobId Job ID from remote server GridJobStatus Job status from remote server LastRemoteStatusUpdate Time job status last checked GridResourceUnavailableTime Time when remote server went down 12

  13. Two common uses for Grid Universe Jobs 1. Pilot Factories (e.g. glideinWMS) will submit a grid universe job to send pilot jobs to remote site (leverage common interface) 2. HTCondor-CE will route incoming jobs to a site s local scheduler by submitting a grid universe job 13

  14. Gridmanager Daemon Runs under the schedd Similar to the shadow Handles all management of grid jobs Single instance manages all grid jobs for a user

  15. Grid ASCII Helper Protocol (GAHP) Runs under gridmanager Encapsulates grid client libraries in separate process Simple ASCII protocol Easy to use client libraries when they can t be linked directly with gridmanager http://htcondor.org/gahp/

  16. How It Works Pilot Factory Grid Resource CREAM Schedd LSF

  17. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF

  18. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager

  19. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager GAHP

  20. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager GAHP

  21. How It Works 600 Grid jobs Pilot Factory Grid Resource CREAM Schedd LSF Gridmanager User Job GAHP

  22. HTCondor-CE use of Grid Universe Jobs CE Host 1. Grid Job Job Router CE Schedd 2. Routed Job 3. Start Gridmanager 4. qsub, sbatch, etc. Gridmanager 22

  23. Network Connectivity Outbound connections only for most job types GRAM requires incoming connections Need 2 open ports per <user, X509 DN> pair

  24. Authentication GAHP acts as the user with the remote service Destination service is local UID-based authentication E.g. PBS, LSF, SLURM, HTCondor Destination service is remote X.509 proxy (possibly with VOMS attributes) Automatically forward refreshed proxy E.g. HTCondor CE, GRAM, CREAM, ARC CE 24

  25. HELD Status Jobs will be held when HTCondor needs help with an error On release, HTCondor will retry The reason for the hold will be saved in the job ad and user log

  26. Common Errors Authentication Hold reason may be misleading User may not be authorized by CE Condor-G may not have access to all Certificate Authority files User s proxy may have expired

  27. Common Errors CE no longer knows about job CE admin may forcibly remove job files Condor-G is obsessive about not leaving orphaned jobs May need to take extra steps to convince Condor-G that remote job is gone

  28. Thank You! 28

  29. Throttles and Timeouts Limits that prevent Condor-G or CEs from being overwhelmed by large numbers of jobs Defaults are fairly conservative

  30. Throttles and Timeouts GRIDMANAGER_MAX_SUBMITTED_JOBS_PER _RESOURCE = 1000 You can increase to 10,000 or more GRIDMANAGER_JOB_PROBE_INTERVAL Default is 60 seconds Can decrease, but not recommended

  31. Throttles and Timeouts GRIDMANAGER_MAX_PENDING_REQUESTS = 50 Number of commands sent to a GAHP in parallel Can increase to a couple hundred GRIDMANAGER_GAHP_CALL_TIMEOUT = 300 Time after which a GAHP command is considered failed May need to lengthen if pending requests is increased

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#