Scaling Condor on XSEDE for LIGO - Collaborative Computing Project

Scaling Condor on XSEDE for
LIGO
Peter Couvares
Syracuse University
LIGO Scientific Collaboration
Who am I?  What is LIGO?
Former Condor Team member (‘99-’08).
Now at Syracuse University focused on
distributed computing problems for the LIGO
Scientific Collaboration, and fostering a research
computing community at SU more generally.
LIGO (the Laser Interferometer Gravitational-
Wave Observatory) is a large scientific
experiment to detect cosmic gravitational waves
and harness them for scientific research.
http://ligo.org/
The Project
The Charge:
demonstrate whether LIGO can effectively utilize
XSEDE resources for its large-scale computing.
(And if not, why?)
The Challenge:
LIGO’s existing computing model doesn’t map
perfectly to XSEDE.
Four Talks
The political story (NSF)
The cultural story (LIGO + TACC)
The architectural story (what we did)
The technical story (how we did it)
The Political Story
LIGO plans to buy millions of $$$ of
computers later this year to be ready for the
Advanced LIGO detectors as they come online.
LIGO has always done most of its computing
“in-house” on dedicated LIGO clusters, with
good results – so we haven’t tried very hard
(at least not lately) to utilize opportunistic
resources we don’t manage ourselves*.
* notable exception = E@H
The Political Story
Before writing us a check, the NSF wanted to
understand why we only planned to buy our own
private clusters, when some other large NSF projects
are successfully using (or contributing to) shared
resources.
Given the size of the check, the NSF also probably
wanted to know whether we were doing our
computing sensibly, and weren’t building something
unnecessarily inefficient or eccentric.
WARNING: this is my speculation based on fourth-hand
accounts of other people’s guesses.  I could be wrong.
The Cultural Story
The NSF asked LIGO to see if it could run some
or all of its large-scale computing work on
XSEDE.
Stampede was the closest thing to an HTC
cluster in XSEDE, so the NSF told LIGO and
TACC to work together on it.
LIGO View of XSEDE Resources
LIGO View of LIGO Computing
Shotgun Wedding
Shotgun Wedding
LIGO: “We don’t need a car with 12 cylinders
and molybdenum brakes to commute to work.
These Hyundais we’ve got lined up are fine.”
TACC: “You need 
how many 
cars?!?”
Shotgun Wedding
LIGO: So, we just need Condor everywhere, no
firewall, a bunch of yum repos and RPMs
installed on all your machines, single-sign on
for our users using their LIGO.ORG credentials,
and the ability to run VMs as jobs.
TACC: Uh, we don’t normally do any of that
stuff.  And 
no way 
are you running VM jobs.
Shotgun Wedding
TACC: Have you optimized your code?
LIGO: Who do they think we are, amateurs?
Have we optimized our code!  Harrumph!
TACC: Here, look at these FFT results.
LIGO: Oh.  Uh… wow, that’s faster.  Nice!
The Cultural Story
Like any shotgun wedding, neither party were
thrilled to be at the altar under duress.
But we got to work, and quickly dropped the
grumpiness.
The TACC staff turn out to be great to work with,
have all kinds of valuable expertise LIGO can use,
and have been 
extremely
 helpful.
Despite the impedance mismatch, together we
succeeded in running a production LIGO
workflow, at scale, on Stampede.
Key points of contrast between the LDG and Stampede:
Central NFS fileservers (LDG) vs. Lustre DFS (Stampede).
Persistent compute nodes w/state (LDG) vs. transient/stateless
execute nodes (Stampede).
LDG uses persistent local disk for distributed checkpointing
and Condor logging
NFS for job input and output, local scratch disks for runtime file
i/o (LDG) vs. Lustre for everything (Stampede).
Condor batch queue system (LDG) vs. SLURM (Stampede).
Scientific Linux 6.1 (LDG) vs. CentOS 6.5 (Stampede).
Software pre-installed in system locations on dedicated resources
(LDG) vs. local builds on shared resources (Stampede).
Long running jobs (LDG) vs. 48h maximum (Stampede)
15
The Architectural Story
16
Make LDG look more like Stampede
or
Make Stampede look more like LDG
Given our experience porting a large body of LDG software and
workflows to new OS platforms and versions, we knew it took
more time than we had, so we started with the latter and
worked back the other way when it was necessary or easier.
Design Choice
An LDG Site “Overlay” on Stampede
18
Glide-in Condor pool via SLURM
Persistent Condor central manager
Persistent login/submit machine
Make heavy use of Condor standard universe checkpointing to handle
mismatch between SLURM scheduling policies and long running analysis
jobs with unpredictable runtimes.
Pre-install LIGO software (RPMs) site-wide on Stampede.
Use LDR and Globus for data transfer to Lustre via gridftp
Setup LIGO web services in dedicated VM
Data discovery service
LIGO.ORG protected web site to post analysis results
Enable access to LDG with XSEDE credentials and vice-a-versa
Stampede an early XSEDE adopter of CILogon
LDG Overlay on Stampede
19
Validate the ability to transfer simulated aLIGO data from a
LIGO Engineering Run to Stampede and confirm that the CBC
offline detection pipeline can run and generate the same
results on Stampede as the LDG.
Select one LDG site (Syracuse) for detailed comparison runs.
Start with the Initial LIGO (iLIGO) pipeline and well understood input data.
Perform correctness and scaling tests
Optimize performance
Switch to the aLIGO pipeline currently being developed
Perform longer running stability tests
In the background allow for other small scale LIGO tests
The Goal
Setup systems for testing:
Install LIGO software including Condor – PASS
After a few iterations official releases of LIGO software from package
repositories where installed on all Stampede systems.
Setup VM for LIGO to install and manage web services – PASS
Took a few iterations including installing extra certificates and mailing
physical security tokens but straightforward.
Minor change to LIGO web services authentication configuration to
handle different network topology at TACC.
Setup 10G network transfer of LIGO data via Globus and LDR using
gridftp – PASS
Took a while to track down a performance issue due to mismatched
MTU but eventually solved.
Manually support registration of CILogon credentials before XSEDE
deployed that during the test
20
Round 0
Analyze one-day of LIGO data on Stampede
using iLIGO code:
Condor glide-in via SLURM – PASS
Data transfer via LDR – PASS
Central checkpointing – FAIL
Network firewall issues – FAIL
21
Round 1
Analyze two weeks of data:
Solved initial firewall problems – PASS
Improve security by moving Condor Central
Manager to dedicated host (VM) caused new
firewall problems – FAIL
Try solving checkpoint scaling by having
parallel checkpoint writing and central resume
– FAIL
22
Round 2
Analyze 6 weeks of data:
Condor code patch to support parallel checkpoint
save/restore to a shared filesystem without persistent
checkpoint servers – PASS
Scaling to 9,000 concurrent jobs with synchronous
checkpiont/resume woke up TACC support team at
inconvenient hour – MIX
>2000 load avg on submit node
Moved Condor LOCK and LOG files to /dev/shm to reduce
load on Submit machine (temporary solution) – PASS
Scaling to 25k concurrent jobs hit limit of single submit
machine at 13k jobs – MIX
23
Round 3
Submit machine scalability (13k != 25k)
Several straightforward ways to solve
Submit fewer but multi-core jobs
Split work between multiple Submit machines
Further investigate/enhance Condor Shadow scalability
Use a factory to manage glide-ins automatically.
What happens when we don’t have a fortuitous
alignment of OS?
Virtual Machines (not supported on Stampede)
Restrict amount of needed software (focus on production
rather than development computing)
Port necessary packages as opt-in modules
Enhance LIGO packages to be relocatable as more
appropriate for a shared resource
24
Future
Lessons
It takes a lot of work to migrate a "big" computing system
to a new environment.  Something has to give.
It can be done.
Miron might say we “cheated” by statically reproducing
much of our existing environment on Stampede, rather
than bringing it with us – but we had a deadline and it’s a
big first step.
And the cultural accomplishment inside LIGO may end up being
bigger than the technical accomplishment…
XSEDE sites like TACC have incredibly valuable expertise –
you should take advantage of it.  Not being HPC-focused,
we underappreciated it before this exercise.
Lessons
Speaking for myself, not LIGO:
We should have been more optimistic, and more humble,
up front – but we got there.
The NSF should be more clear about what’s going on when
it arranges this kind of thing, to limit FUD.
While LIGO 
must
 manage its own significant computing
resources for some work (e.g., low-latency analysis,
detector characterization, software development, testing,
and training students), we 
can
 use shared resources 
like
Stampede
 today for a large fraction of our computing.
Longer-term, LIGO should develop its “grid plumbing” to
enable more flexible use of other shared resources that
can’t be made to look like LDG sites as easily as Stampede.
 
Acknowledgements
Apologies in advance to those I surely forgot…
LIGO
Stuart Anderson, Duncan Brown, Kent Blackburn, Josh
Willis, Patrick Brady, many
 
others
TACC
Yaakoub El Khamra, Luke Wilson, John Cazes, John
McCalpin, Bill Barth, Nathaniel Mendoza, many others
Condor
Greg Thain, Alan De Smet, many others
NSF
Faceless bureaucrats who forced us out of our rut!
Slide Note
Embed
Share

The project aims to evaluate the utilization of XSEDE resources by LIGO for large-scale computing tasks, with a focus on distributed computing challenges and fostering a research computing community. Various aspects such as political, cultural, and technical narratives surrounding the collaboration are explored to understand the alignment of LIGO's computing model with XSEDE infrastructure.

  • LIGO
  • XSEDE
  • Collaborative Computing
  • Distributed Computing
  • Research Community

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Scaling Condor on XSEDE for LIGO Peter Couvares Syracuse University LIGO Scientific Collaboration

  2. Who am I? What is LIGO? Former Condor Team member ( 99- 08). Now at Syracuse University focused on distributed computing problems for the LIGO Scientific Collaboration, and fostering a research computing community at SU more generally. LIGO (the Laser Interferometer Gravitational- Wave Observatory) is a large scientific experiment to detect cosmic gravitational waves and harness them for scientific research. http://ligo.org/

  3. The Project The Charge: demonstrate whether LIGO can effectively utilize XSEDE resources for its large-scale computing. (And if not, why?) The Challenge: LIGO s existing computing model doesn t map perfectly to XSEDE.

  4. Four Talks The political story (NSF) The cultural story (LIGO + TACC) The architectural story (what we did) The technical story (how we did it)

  5. The Political Story LIGO plans to buy millions of $$$ of computers later this year to be ready for the Advanced LIGO detectors as they come online. LIGO has always done most of its computing in-house on dedicated LIGO clusters, with good results so we haven t tried very hard (at least not lately) to utilize opportunistic resources we don t manage ourselves*. * notable exception = E@H

  6. The Political Story Before writing us a check, the NSF wanted to understand why we only planned to buy our own private clusters, when some other large NSF projects are successfully using (or contributing to) shared resources. Given the size of the check, the NSF also probably wanted to know whether we were doing our computing sensibly, and weren t building something unnecessarily inefficient or eccentric. WARNING: this is my speculation based on fourth-hand accounts of other people s guesses. I could be wrong.

  7. The Cultural Story The NSF asked LIGO to see if it could run some or all of its large-scale computing work on XSEDE. Stampede was the closest thing to an HTC cluster in XSEDE, so the NSF told LIGO and TACC to work together on it.

  8. LIGO View of XSEDE Resources

  9. LIGO View of LIGO Computing

  10. Shotgun Wedding

  11. Shotgun Wedding LIGO: We don t need a car with 12 cylinders and molybdenum brakes to commute to work. These Hyundais we ve got lined up are fine. TACC: You need how many cars?!?

  12. Shotgun Wedding LIGO: So, we just need Condor everywhere, no firewall, a bunch of yum repos and RPMs installed on all your machines, single-sign on for our users using their LIGO.ORG credentials, and the ability to run VMs as jobs. TACC: Uh, we don t normally do any of that stuff. And no way are you running VM jobs.

  13. Shotgun Wedding TACC: Have you optimized your code? LIGO: Who do they think we are, amateurs? Have we optimized our code! Harrumph! TACC: Here, look at these FFT results. LIGO: Oh. Uh wow, that s faster. Nice!

  14. The Cultural Story Like any shotgun wedding, neither party were thrilled to be at the altar under duress. But we got to work, and quickly dropped the grumpiness. The TACC staff turn out to be great to work with, have all kinds of valuable expertise LIGO can use, and have been extremely helpful. Despite the impedance mismatch, together we succeeded in running a production LIGO workflow, at scale, on Stampede.

  15. The Architectural Story Key points of contrast between the LDG and Stampede: Central NFS fileservers (LDG) vs. Lustre DFS (Stampede). Persistent compute nodes w/state (LDG) vs. transient/stateless execute nodes (Stampede). LDG uses persistent local disk for distributed checkpointing and Condor logging NFS for job input and output, local scratch disks for runtime file i/o (LDG) vs. Lustre for everything (Stampede). Condor batch queue system (LDG) vs. SLURM (Stampede). Scientific Linux 6.1 (LDG) vs. CentOS 6.5 (Stampede). Software pre-installed in system locations on dedicated resources (LDG) vs. local builds on shared resources (Stampede). Long running jobs (LDG) vs. 48h maximum (Stampede) 15

  16. Design Choice Make LDG look more like Stampede or Make Stampede look more like LDG Given our experience porting a large body of LDG software and workflows to new OS platforms and versions, we knew it took more time than we had, so we started with the latter and worked back the other way when it was necessary or easier. 16

  17. An LDG Site Overlay on Stampede

  18. LDG Overlay on Stampede Glide-in Condor pool via SLURM Persistent Condor central manager Persistent login/submit machine Make heavy use of Condor standard universe checkpointing to handle mismatch between SLURM scheduling policies and long running analysis jobs with unpredictable runtimes. Pre-install LIGO software (RPMs) site-wide on Stampede. Use LDR and Globus for data transfer to Lustre via gridftp Setup LIGO web services in dedicated VM Data discovery service LIGO.ORG protected web site to post analysis results Enable access to LDG with XSEDE credentials and vice-a-versa Stampede an early XSEDE adopter of CILogon 18

  19. The Goal Validate the ability to transfer simulated aLIGO data from a LIGO Engineering Run to Stampede and confirm that the CBC offline detection pipeline can run and generate the same results on Stampede as the LDG. Select one LDG site (Syracuse) for detailed comparison runs. Start with the Initial LIGO (iLIGO) pipeline and well understood input data. Perform correctness and scaling tests Optimize performance Switch to the aLIGO pipeline currently being developed Perform longer running stability tests In the background allow for other small scale LIGO tests 19

  20. Round 0 Setup systems for testing: Install LIGO software including Condor PASS After a few iterations official releases of LIGO software from package repositories where installed on all Stampede systems. Setup VM for LIGO to install and manage web services PASS Took a few iterations including installing extra certificates and mailing physical security tokens but straightforward. Minor change to LIGO web services authentication configuration to handle different network topology at TACC. Setup 10G network transfer of LIGO data via Globus and LDR using gridftp PASS Took a while to track down a performance issue due to mismatched MTU but eventually solved. Manually support registration of CILogon credentials before XSEDE deployed that during the test 20

  21. Round 1 Analyze one-day of LIGO data on Stampede using iLIGO code: Condor glide-in via SLURM PASS Data transfer via LDR PASS Central checkpointing FAIL Network firewall issues FAIL 21

  22. Round 2 Analyze two weeks of data: Solved initial firewall problems PASS Improve security by moving Condor Central Manager to dedicated host (VM) caused new firewall problems FAIL Try solving checkpoint scaling by having parallel checkpoint writing and central resume FAIL 22

  23. Round 3 Analyze 6 weeks of data: Condor code patch to support parallel checkpoint save/restore to a shared filesystem without persistent checkpoint servers PASS Scaling to 9,000 concurrent jobs with synchronous checkpiont/resume woke up TACC support team at inconvenient hour MIX >2000 load avg on submit node Moved Condor LOCK and LOG files to /dev/shm to reduce load on Submit machine (temporary solution) PASS Scaling to 25k concurrent jobs hit limit of single submit machine at 13k jobs MIX 23

  24. Future Submit machine scalability (13k != 25k) Several straightforward ways to solve Submit fewer but multi-core jobs Split work between multiple Submit machines Further investigate/enhance Condor Shadow scalability Use a factory to manage glide-ins automatically. What happens when we don t have a fortuitous alignment of OS? Virtual Machines (not supported on Stampede) Restrict amount of needed software (focus on production rather than development computing) Port necessary packages as opt-in modules Enhance LIGO packages to be relocatable as more appropriate for a shared resource 24

  25. Lessons It takes a lot of work to migrate a "big" computing system to a new environment. Something has to give. It can be done. Miron might say we cheated by statically reproducing much of our existing environment on Stampede, rather than bringing it with us but we had a deadline and it s a big first step. And the cultural accomplishment inside LIGO may end up being bigger than the technical accomplishment XSEDE sites like TACC have incredibly valuable expertise you should take advantage of it. Not being HPC-focused, we underappreciated it before this exercise.

  26. Lessons Speaking for myself, not LIGO: We should have been more optimistic, and more humble, up front but we got there. The NSF should be more clear about what s going on when it arranges this kind of thing, to limit FUD. While LIGO must manage its own significant computing resources for some work (e.g., low-latency analysis, detector characterization, software development, testing, and training students), we can use shared resources like Stampede today for a large fraction of our computing. Longer-term, LIGO should develop its grid plumbing to enable more flexible use of other shared resources that can t be made to look like LDG sites as easily as Stampede.

  27. Acknowledgements Apologies in advance to those I surely forgot LIGO Stuart Anderson, Duncan Brown, Kent Blackburn, Josh Willis, Patrick Brady, manyothers TACC Yaakoub El Khamra, Luke Wilson, John Cazes, John McCalpin, Bill Barth, Nathaniel Mendoza, many others Condor Greg Thain, Alan De Smet, many others NSF Faceless bureaucrats who forced us out of our rut!

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#