Overview of Latest HTCondor Features and Enhancements

undefined
 
W
h
a
t
s
 
n
e
w
 
i
n
 
H
T
C
o
n
d
o
r
?
W
h
a
t
s
 
c
o
m
i
n
g
?
H
T
C
o
n
d
o
r
 
W
e
e
k
 
2
0
1
7
M
a
d
i
s
o
n
,
 
W
I
 
-
-
 
M
a
y
 
3
,
 
2
0
1
7
 
T
o
d
d
 
T
a
n
n
e
n
b
a
u
m
C
e
n
t
e
r
 
f
o
r
 
H
i
g
h
 
T
h
r
o
u
g
h
p
u
t
 
C
o
m
p
u
t
i
n
g
D
e
p
a
r
t
m
e
n
t
 
o
f
 
C
o
m
p
u
t
e
r
 
S
c
i
e
n
c
e
s
U
n
i
v
e
r
s
i
t
y
 
o
f
 
W
i
s
c
o
n
s
i
n
-
M
a
d
i
s
o
n
 
“...we have identified six key challenge areas that we believe
will drive HTC technologies innovation in the next five
years.”
E
v
o
l
v
i
n
g
 
r
e
s
o
u
r
c
e
 
a
c
q
u
i
s
i
t
i
o
n
 
m
o
d
e
l
s
H
a
r
d
w
a
r
e
 
c
o
m
p
l
e
x
i
t
y
W
i
d
e
l
y
 
d
i
s
p
a
r
a
t
e
 
u
s
e
 
c
a
s
e
s
D
a
t
a
 
i
n
t
e
n
s
i
v
e
 
c
o
m
p
u
t
i
n
g
B
l
a
c
k
-
b
o
x
 
a
p
p
l
i
c
a
t
i
o
n
s
S
c
a
l
a
b
i
l
i
t
y
 
C
h
a
l
l
e
n
g
e
 
A
r
e
a
s
 
2
 
 
3
 
R
e
l
e
a
s
e
 
T
i
m
e
l
i
n
e
 
Stable Series
h
HTCondor v8.6.x - introduced Jan 2017
   Currently at v8.6.2
   (Last year at v8.4.6)
Development Series (
should be 'new features'
series)
h
HTCondor v8.7.x
   Currently at v8.7.1
   (Last year at v8.5.4)
 
Scalability
 and stability
h
Goal: 200k slots in one pool, 10 schedds managing 400k jobs
Introduced Docker Job Universe
IPv6 support
Tool improvements, esp condor_submit
Encrypted Job Execute Directory
Periodic application-layer checkpoint support in Vanilla
Universe
Submit requirements
New RPM / DEB packaging
Systemd / SELinux compatibility
 
 
 
 
 
 
 
 
E
n
h
a
n
c
e
m
e
n
t
s
 
i
n
 
H
T
C
o
n
d
o
r
 
v
8
.
4
d
i
s
c
u
s
s
e
d
 
l
a
s
t
 
y
e
a
r
 
4
 
S
o
m
e
 
e
n
h
a
n
c
e
m
e
n
t
s
 
i
n
H
T
C
o
n
d
o
r
 
v
8
.
6
 
5
 
6
 
Page 790
 
Enabled by default
: shared port, cgroups,
IPv6
h
Have both IPv4 and v6? Prefer IPv4 for now
Configured by default
: Kernel tuning
Easier to configure
: Enforce slot sizes
h
use policy: preempt_if_cpus_exceeded
h
use policy: hold_if_cpus_exceeded
h
use policy: preempt_if_memory_exceeded
h
use policy: hold_if_memory_exceeded
 
E
n
a
b
l
e
d
 
b
y
 
d
e
f
a
u
l
t
 
a
n
d
/
o
r
e
a
s
i
e
r
 
t
o
 
c
o
n
f
i
g
u
r
e
 
7
 
Dew drinker?  Use old way
executable = foo.exe
on_exit_remove = \
(ExitBySignal == False && \
 ExitCode == 0) || \
 NumJobStarts >= 3
queue
Shower regularly? Use
new way
executable = foo.exe
max_retries = 3
queue
E
a
s
i
e
r
 
t
o
 
r
e
t
r
y
 
j
o
b
s
 
i
f
 
y
o
u
 
s
h
o
w
e
r
8
 
Only show jobs owned by the user
h
disable with 
-allusers
Batched output (
-batch
, 
-nobatch
)
New default output of condor_q will show summary
of current user's jobs.
---- Schedd: submit-3.batlab.org : <128.104.100.22:50004?... @ 05/02/17 11:19:41
OWNER    BATCH_NAME         SUBMITTED   DONE   RUN    IDLE   HOLD   TOTAL  JOB_IDS
tannenba CMD: /bin/python  4/27 11:58    463    87   19450      5   20000  9.463-467
tannenba mydag.dag+10      4/27 19:13   9824     1      _       _    9825  10.0
 
29900 jobs; 10287 completed, 0 removed, 19450 idle, 88 running, 5 held, 0 suspended
 
N
e
w
 
c
o
n
d
o
r
_
q
 
d
e
f
a
u
l
t
 
o
u
t
p
u
t
 
9
 
Allow admin to have the schedd securely
add/edit/validate job attributes upon job
submission
h
Can also set attributes as immutable by the user,
e.g. cannot edit w/ condor_qedit or chirp
Get rid of condor_submit wrapper scripts!
One use case: insert accounting group
attributes based upon the submitter
use feature: ​AssignAccountingGroup( 
filename
 )
 
S
c
h
e
d
d
 
J
o
b
 
T
r
a
n
s
f
o
r
m
s
T
r
a
n
s
f
o
r
m
a
t
i
o
n
 
o
f
 
j
o
b
 
a
d
 
u
p
o
n
 
s
u
b
m
i
t
 
10
 
Docker jobs get usage updates (i.e.
network usage) reported in job classad
Admin can add additional volumes
h
That all docker universe jobs get
h
Why?
Large shared data
Condor Chirp support
 Also new knob:
DOCKER_DROP_ALL_CAPABILITIES
 
D
o
c
k
e
r
 
U
n
i
v
e
r
s
e
 
E
n
h
a
n
c
e
m
e
n
t
s
 
11
 
H
T
C
o
n
d
o
r
 
S
i
n
g
u
l
a
r
i
t
y
 
I
n
t
e
g
r
a
t
i
o
n
 
12
 
 
 
 
What is Singularity?
    http://singularity.lbl.gov/
    Like Docker but…
h
No root owned daemon process, just a setuid
h
No setuid required (post RHEL7)
h
Easy access to host resources incl GPU,
network, file systems
Sounds perfect for glideins/pilots!
h
Maybe no need for UID switching
 
JSON output from condor_status,
condor_q, condor_history via "-json" flag
condor_history -since 
<
jobid
 or
expression
>
Config file syntax enhancements (includes,
conditionals, …)
 
A
n
d
 
l
o
t
s
 
m
o
r
e
 
13
 
S
o
m
e
 
e
n
h
a
n
c
e
m
e
n
t
s
 
i
n
H
T
C
o
n
d
o
r
 
v
8
.
7
 
a
n
d
 
b
e
y
o
n
d
 
14
 
User accounting information moved into
ads in the Collector
h
Enable schedd to move claims across users
Non-blocking authentication, smarter
updates to the collector, faster ClassAd
processing
Late materialization
 of jobs in the schedd
 to
enable submission of very large sets of jobs
h
More jobs materialized once number of idle
jobs drops below a threshold (like DAGMan
throttling)
 
S
m
a
r
t
e
r
 
a
n
d
 
F
a
s
t
e
r
 
S
c
h
e
d
d
 
15
 
16
 
G
r
i
d
 
U
n
i
v
e
r
s
e
 
Reliable, durable submission of a job to a remote scheduler
Popular way to send pilot jobs, key component of HTCondor-
CE
Supports many “back end” types:
h
HTCondor
h
PBS
h
LSF
h
Grid Engine
h
Google Compute Engine
h
Amazon EC2
h
OpenStack
h
Cream
h
NorduGrid ARC
h
BOINC
h
Globus: GT2, GT5
h
UNICORE
 
 
Speak native SLURM protocol
h
No need to install PBS
compatibility package
Speak to Microsoft Azure
Speak OpenStack’s NOVA
protocol
h
No need for EC2 compatibility
layer
Speak to Cobalt Scheduler
h
Argonne Leadership Computing
Facilities
 
A
d
d
 
G
r
i
d
 
U
n
i
v
e
r
s
e
 
s
u
p
p
o
r
t
 
f
o
r
 
S
L
U
R
M
,
A
z
u
r
e
,
 
O
p
e
n
S
t
a
c
k
,
 
C
o
b
a
l
t
 
17
 
Jaime:
Grid
Jedi
 
Start virtual machines as HTCondor
execute nodes in public clouds that join
your pool
Leverage efficient AWS APIs such as Auto
Scaling Groups and Spot Fleets
Secure mechanism for cloud instances to
join the HTCondor pool at home institution
 
E
l
a
s
t
i
c
a
l
l
y
 
g
r
o
w
 
y
o
u
r
 
p
o
o
l
 
i
n
t
o
t
h
e
 
C
l
o
u
d
:
 
c
o
n
d
o
r
_
a
n
n
e
x
 
18
 
+ Decide which type(s) of instances to use.
+ Pick a machine image, install HTCondor.
+ Configure HTCondor:
h
 to securely join the pool. (Coordinate with pool admin.)
h
 to shut down instance when not running a job (because of
the long tail or a problem somewhere)
+ Decide on a bid for each instance type, according to its
location (or pay more).
+ Configure the network and firewall at Amazon.
+ Implement a fail-safe in the form of a lease to make sure
the pool does eventually shut itself off.
+ Automate response to being out-bid.
 
 
W
i
t
h
o
u
t
 
c
o
n
d
o
r
_
a
n
n
e
x
 
19
19
 Goal: Simplified to a single command:
condor_annex -annex-name 'ProfNeedsMoore_Lab' \
 
-count  \
 
--instances 1000
20
20
W
i
t
h
 
c
o
n
d
o
r
_
a
n
n
e
x
 
L
i
v
e
 
d
e
m
o
 
o
f
l
a
t
e
 
j
o
b
 
m
a
t
e
r
i
a
l
i
z
a
t
i
o
n
a
n
d
H
T
C
o
n
d
o
r
 
A
n
n
e
x
 
t
o
 
E
C
2
.
.
.
 
21
 
HTCondor currently allows you to
authenticate users and daemons using
Kerberos
However, it does NOT currently provide any
mechanism to provide a Kerberos credential
for the actual job to use on the execute slot
 
H
T
C
o
n
d
o
r
 
a
n
d
 
K
e
r
b
e
r
o
s
 
22
 
So we are adding support to launch jobs
with Kerberos tickets / AFS tokens
Details
h
HTCondor 8.5.X to allows an opaque security
credential to be obtained by condor_submit and stored
securely alongside the queued job ( in the
condor_credd daemon )
h
This credential is then moved with the job to the
execute machine
h
Before the job begins executing, the condor_starter
invokes a call-out to do optional transformations on the
credential
 
H
T
C
o
n
d
o
r
 
a
n
d
 
K
e
r
b
e
r
o
s
/
A
F
S
 
23
 
D
A
G
M
a
n
 
I
m
p
r
o
v
e
m
e
n
t
s
 
ALL_NODES
RETRY ALL_NODES 3
Flexible DAG file command order
Splice Pin connections
Allows more flexible parent/child relationships
between nodes within splices
 
Only show one line of output per machine
Can try now in v8.5.4+ with "-compact"
option
The "-compact" option will become the new
default once we are happy with it
 
 
Machine      Platform     Slots Cpus Gpus  TotalGb FreCpu  FreeGb  CpuLoad ST
 
gpu-1        x64/SL6          8    8    2    15.57      0     0.44    1.90 Cb
gpu-2        x64/SL6          8    8    2    15.57      0     0.57    1.87 Cb
gpu-3        x64/SL6          8    8    4    47.13      0    16.13    0.85 Cb
matlab-build x64/SL6          1   12         23.45     11    23.33    0.00 **
mem1         x64/SL6         32   80       1009.67      0   160.17    1.00 Cb
 
N
e
w
 
c
o
n
d
o
r
_
s
t
a
t
u
s
 
d
e
f
a
u
l
t
 
o
u
t
p
u
t
 
25
 
 
In addition to (or instead of) sending to Ganglia,
aggregate and make available in JSON format
over HTTP
h
condor_gangliad
 rename to 
condor_metricd
View some basic historical usage out-of-the-box
by pointing web browser at central manager
(modern CondorView)…
Or upload to influxdb, graphite for Grafana
 
M
o
r
e
 
b
a
c
k
e
n
d
s
 
f
o
r
c
o
n
d
o
r
_
g
a
n
g
a
l
i
a
d
 
26
 
27
 
Potential Future Docker
Universe Features?
 
Advertise images already cached on
machine ?
Support for condor_ssh_to_job ?
Package and release HTCondor into
Docker Hub ?
Network support beyond NAT?
Run containers as root??!?!?
Automatic checkpoint and restart of
containers!  (via CRIU)
 
Working with the cloud : elasticity into the
cloud.
Scalability.
More manageable, monitoring.
Containers.
Data, incl storage management options
More Python interfaces
 
T
h
e
 
f
u
t
u
r
e
 
29
 
T
h
a
n
k
 
Y
o
u
!
 
30
 
P.S.  Interested in working
on HTCondor full time?
Talk to me! We are hiring!
htcondor-jobs@cs.wisc.edu
Slide Note
Embed
Share

HTCondor, a high-throughput computing software, has seen significant advancements in its stable v8.6.x and development v8.7.x series. The latest enhancements include scalability improvements, Docker Job Universe integration, IPv6 support, and various tool enhancements. Users can now easily configure settings like slot sizes, job retry policies, and job ownership display. Stay informed about the latest changes and updates in the HTCondor ecosystem.

  • HTCondor
  • High-throughput computing
  • Software enhancements
  • Computing tools
  • Job management

Uploaded on Feb 27, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Whats new in HTCondor? What s coming? HTCondor Week 2017 Madison, WI -- May 3, 2017 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

  2. Release Timeline Stable Series HTCondor v8.6.x - introduced Jan 2017 Currently at v8.6.2 (Last year at v8.4.6) Development Series (should be 'new features' series) HTCondor v8.7.x Currently at v8.7.1 (Last year at v8.5.4) 3

  3. Enhancements in HTCondor v8.4 discussed last year Scalability and stability Goal: 200k slots in one pool, 10 schedds managing 400k jobs Introduced Docker Job Universe IPv6 support Tool improvements, esp condor_submit Encrypted Job Execute Directory Periodic application-layer checkpoint support in Vanilla Universe Submit requirements New RPM / DEB packaging Systemd / SELinux compatibility 4

  4. Some enhancements in HTCondor v8.6 5

  5. Page 790 6

  6. Enabled by default and/or easier to configure Enabled by default: shared port, cgroups, IPv6 Have both IPv4 and v6? Prefer IPv4 for now Configured by default: Kernel tuning Easier to configure: Enforce slot sizes use policy: preempt_if_cpus_exceeded use policy: hold_if_cpus_exceeded use policy: preempt_if_memory_exceeded use policy: hold_if_memory_exceeded 7

  7. Easier to retry jobs if you shower Dew drinker? Use old way executable = foo.exe on_exit_remove = \ (ExitBySignal == False && \ ExitCode == 0) || \ NumJobStarts >= 3 queue Shower regularly? Use new way executable = foo.exe max_retries = 3 queue 8

  8. New condor_q default output Only show jobs owned by the user disable with -allusers Batched output (-batch, -nobatch) New default output of condor_q will show summary of current user's jobs. ---- Schedd: submit-3.batlab.org : <128.104.100.22:50004?... @ 05/02/17 11:19:41 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS tannenba CMD: /bin/python 4/27 11:58 463 87 19450 5 20000 9.463-467 tannenba mydag.dag+10 4/27 19:13 9824 1 _ _ 9825 10.0 29900 jobs; 10287 completed, 0 removed, 19450 idle, 88 running, 5 held, 0 suspended 9

  9. Schedd Job Transforms Transformation of job ad upon submit Allow admin to have the schedd securely add/edit/validate job attributes upon job submission Can also set attributes as immutable by the user, e.g. cannot edit w/ condor_qedit or chirp Get rid of condor_submit wrapper scripts! One use case: insert accounting group attributes based upon the submitter use feature: AssignAccountingGroup( filename ) 10

  10. Docker Universe Enhancements Docker jobs get usage updates (i.e. network usage) reported in job classad Admin can add additional volumes That all docker universe jobs get Why? Large shared data Condor Chirp support Also new knob: DOCKER_DROP_ALL_CAPABILITIES 11

  11. HTCondor Singularity Integration What is Singularity? http://singularity.lbl.gov/ Like Docker but No root owned daemon process, just a setuid No setuid required (post RHEL7) Easy access to host resources incl GPU, network, file systems Sounds perfect for glideins/pilots! Maybe no need for UID switching 12

  12. And lots more JSON output from condor_status, condor_q, condor_history via "-json" flag condor_history -since <jobid or expression> Config file syntax enhancements (includes, conditionals, ) 13

  13. Some enhancements in HTCondor v8.7 and beyond 14

  14. Smarter and Faster Schedd User accounting information moved into ads in the Collector Enable schedd to move claims across users Non-blocking authentication, smarter updates to the collector, faster ClassAd processing Late materialization of jobs in the schedd to enable submission of very large sets of jobs More jobs materialized once number of idle jobs drops below a threshold (like DAGMan throttling) 15

  15. Grid Universe Reliable, durable submission of a job to a remote scheduler Popular way to send pilot jobs, key component of HTCondor- CE Supports many back end types: HTCondor PBS LSF Grid Engine Google Compute Engine Amazon EC2 OpenStack Cream NorduGrid ARC BOINC Globus: GT2, GT5 UNICORE 16

  16. Add Grid Universe support for SLURM, Azure, OpenStack, Cobalt Speak native SLURM protocol No need to install PBS compatibility package Speak to Microsoft Azure Speak OpenStack s NOVA protocol No need for EC2 compatibility layer Speak to Cobalt Scheduler Argonne Leadership Computing Facilities Jaime: Grid Jedi 17

  17. Elastically grow your pool into the Cloud: condor_annex Start virtual machines as HTCondor execute nodes in public clouds that join your pool Leverage efficient AWS APIs such as Auto Scaling Groups and Spot Fleets Secure mechanism for cloud instances to join the HTCondor pool at home institution 18

  18. 1 9 Without condor_annex + Decide which type(s) of instances to use. + Pick a machine image, install HTCondor. + Configure HTCondor: to securely join the pool. (Coordinate with pool admin.) to shut down instance when not running a job (because of the long tail or a problem somewhere) + Decide on a bid for each instance type, according to its location (or pay more). + Configure the network and firewall at Amazon. + Implement a fail-safe in the form of a lease to make sure the pool does eventually shut itself off. + Automate response to being out-bid.

  19. 2 0 With condor_annex Goal: Simplified to a single command: condor_annex -annex-name 'ProfNeedsMoore_Lab' \ -count \ --instances 1000

  20. Live demo of late job materialization and HTCondor Annex to EC2... 21

  21. HTCondor and Kerberos HTCondor currently allows you to authenticate users and daemons using Kerberos However, it does NOT currently provide any mechanism to provide a Kerberos credential for the actual job to use on the execute slot 22

  22. HTCondor and Kerberos/AFS So we are adding support to launch jobs with Kerberos tickets / AFS tokens Details HTCondor 8.5.X to allows an opaque security credential to be obtained by condor_submit and stored securely alongside the queued job ( in the condor_credd daemon ) This credential is then moved with the job to the execute machine Before the job begins executing, the condor_starter invokes a call-out to do optional transformations on the credential 23

  23. DAGMan Improvements ALL_NODES RETRY ALL_NODES 3 Flexible DAG file command order Splice Pin connections Allows more flexible parent/child relationships between nodes within splices

  24. New condor_status default output Only show one line of output per machine Can try now in v8.5.4+ with "-compact" option The "-compact" option will become the new default once we are happy with it Machine Platform Slots Cpus Gpus TotalGb FreCpu FreeGb CpuLoad ST gpu-1 x64/SL6 8 8 2 15.57 0 0.44 1.90 Cb gpu-2 x64/SL6 8 8 2 15.57 0 0.57 1.87 Cb gpu-3 x64/SL6 8 8 4 47.13 0 16.13 0.85 Cb matlab-build x64/SL6 1 12 23.45 11 23.33 0.00 ** mem1 x64/SL6 32 80 1009.67 0 160.17 1.00 Cb 25

  25. More backends for condor_gangaliad In addition to (or instead of) sending to Ganglia, aggregate and make available in JSON format over HTTP condor_gangliad rename to condor_metricd View some basic historical usage out-of-the-box by pointing web browser at central manager (modern CondorView) Or upload to influxdb, graphite for Grafana 26

  26. 27

  27. Potential Future Docker Universe Features? Advertise images already cached on machine ? Support for condor_ssh_to_job ? Package and release HTCondor into Docker Hub ? Network support beyond NAT? Run containers as root??!?!? Automatic checkpoint and restart of containers! (via CRIU)

  28. The future Working with the cloud : elasticity into the cloud. Scalability. More manageable, monitoring. Containers. Data, incl storage management options More Python interfaces 29

  29. Thank You! P.S. Interested in working on HTCondor full time? Talk to me! We are hiring! htcondor-jobs@cs.wisc.edu 30

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#