Swarm on the Biowulf2 Cluster

Swarm on the Biowulf2 Cluster
Dr.  David  Hoover,  SCB,  CIT,  NIH
staff@hpc.nih.gov
September 24,  2015
What is swarm?
Wrapper script that simplifies running
individual commands on the Biowulf cluster
Documentation
'man swarm'
'swarm --help'
https://hpc.nih.gov/apps/swarm.html
Differences in Biowulf 1 -> 2
Jobs are allocated cores, rather than nodes
All resources must be allocated
All swarms are job arrays
nod
e
socket
cpus,
hyperthreads
Architectural Digest
subjobs
swarm
swarm -t 2
Simple swarm
single-threaded swarm
subjobs
swarm -t 3
swarm -t 4
0
1
2
3
Simple swarm
multi-threaded swarm
Basic swarm usage
Create swarmfile (file.swarm)
Submit and wait for output
cd /home/user/dir1 ; ls –l
cd /home/user/dir2 ; ls –l
cd /home/user/dir3 ; ls –l
cd /home/user/dir4 ; ls -l
$ swarm –f file.swarm
1234567
$ ls
swarm_1234567_0.o    swarm_1234567_1.o    swarm_1234567_2.o
swarm_1234567_0.e    swarm_1234567_1.e    swarm_1234567_2.e
Standard swarm options
-f : swarmfile, list of commands to run
-g : GB per process (NOT PER NODE OR JOB!)
-t : threads/cpus per process (DITTO!)
$ swarm –f file.swarm –g 4 –t 4
command –p 4 –i /path/to/input1 –o /path/to/output1
command –p 4 –i /path/to/input2 –o /path/to/output2
command –p 4 –i /path/to/input3 –o /path/to/output3
command –p 4 –i /path/to/input4 –o /path/to/output4
-t auto
Autothreading still enabled
This allocates an entire node to each process
java –Xmx8g –jar /path/to/jarfile –opt1 –opt2 –opt3
$ swarm –f file.swarm –g 4 –t auto
-b, --bundle
Bundling is slightly different
swarms of > 1000 commands are autobundled
--autobundle is deprecated
--singleout
Concatenate all .o and .e into single files
Not entirely reliable!
CANCELLED and TIMEOUT will lose all output
Better to use --logdir (described later)
Miscellaneous
--usecsh
--no-comment, --comment-char
--no-scripts, --keep-scripts
--debug, --devel, --verbose, --silent
--devel
 
$
 swarm --devel -f file.swarm -b 4 -g 8 -v 4
------------------------------------------------------------
SWARM
├── subjob 0:  4 commands (1 cpu, 8.00 gb)
├── subjob 1:  4 commands (1 cpu, 8.00 gb)
├── subjob 2:  4 commands (1 cpu, 8.00 gb)
├── subjob 3:  4 commands (1 cpu, 8.00 gb)
------------------------------------------------------------
4 subjobs, 16 commands, 0 output file
16 commands run in 4 subjobs, each command requiring 8 gb
and 1 thread, running 4 processes serially per subjob
sbatch --array=0-3 --job-name=swarm --
output=/home/user/test/swarm_%A_%a.o --
error=/home/user/test/swarm_%A_%a.e --cpus-per-task=1 --
mem=8192 --partition=norm --time=16:00:00
/spin1/swarm/user/I8DQDX4O.batch
No more .swarm directories
swarm scripts are written to a central, shared
area
Each user has their own subdirectory
$ tree /spin1/swarm/user
/spin1/swarm/user
├── 2341529
│   ├── cmd.0
│   ├── cmd.1
│   ├── cmd.2
│   └── cmd.3
└── 2341529.batch
--license
--license replaces -R or --resource
$ swarm –f file.swarm --license=matlab
--module
--module now requires comma-delimited list,
rather than space delimited list:
$ swarm –f file.swarm --module python,samtools,bwa,tophat
--gres
--gres stands for "generic resources"
Is used for allocating local disk space
Replaces --disk
/lscratch/$SLURM_JOBID
Above example gives 50GB of scratch space
See 
https://hpc.nih.gov/docs/userguide.html#local
$ swarm –f file.swarm --gres=lscratch:50
-p, --processes-per-subjob
For single-threaded commands
-p can only be 1 or 2
subjobs
swarm -p 2
0
--logdir
Redirects .o and .e files to a directory
The directory must first exist
$ mkdir /data/user/trashbin
$ swarm –f files.swarm --logdir /data/user/trashbin
1234567
$ ls
file.swarm
$ ls /data/user/trashbin
swarm_1234567_0.o    swarm_1234567_1.o    swarm_1234567_2.o
swarm_1234567_0.e    swarm_1234567_1.e    swarm_1234567_2.e
--time
All jobs must have a walltime now
--time for swarm is per command, not per
swarm or per subjob
--time multiplied by bundle factor
$ swarm -f file.swarm --devel 
--time=01:00:00
32 commands run in 32 subjobs, each command requiring 1.5 gb and 1 thread, allocating 32
cores and 64 cpus
sbatch --array=0-31 ... 
--time=01:00:00
 /spin1/swarm/hooverdm/iMdaW6dO.batch
$ swarm -f file.swarm --devel 
--time=01:00:00
 
-b 4
32 commands run in 8 subjobs, each command requiring 1.5 gb and 1 thread, running 4
processes serially per subjob
sbatch --array=0-7 ... 
--time=04:00:00
 /spin1/swarm/hooverdm/zYrUbkiO.batch
Primary sbatch options
--job-name
--dependency
--time, --gres
--partition
--qos
ALL sbatch options
--sbatch
Type 'man sbatch' for more information
$ swarm -f file.swarm --sbatch "--mail-type=FAIL --
export=var=100,nctype=12 --workdir=/data/user/test"
--prologue and --epilogue
way too difficult to implement
conflicts with --prolog and --epilog options to
srun
--W block=true
sbatch does not allow blocking
must use srun instead
-R, --resource
gpfs is available on all nodes
Replaced by a combination of --license, --gres,
and --constraint
Examples
single-threaded commands
multi-threaded commands
large memory, single-threaded
$ swarm –f file.swarm
$ swarm –f file.swarm –t 4
$ swarm –f file.swarm –g 32
Examples
>10K single-threaded commands
wait for it and deal with the output ...
$ mkdir /data/user/bigswarm
$ swarm –f file.swarm --job-name bigswarm --logdir
/data/user/bigswarm
$ cd /data/user/bigswarm
$ cat *.e > bigswarm.err
$ cat *.o > bigswarm.out
$ rm *.{e,o}
Examples
Large temp files
$ swarm –f file.swarm --gres=lscratch:200
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 1
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 2
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 3
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 4
Examples
Dependencies
$ sbatch script_1.sh
10000
$ swarm -f file.swarm --dependency=afterany:10000
10001
$ swarm -f file2.swarm --dependency=afterany:10001
10002
$ sbatch sbatch_2.sh --dependency=afterany:10002
10003
Examples
Long-running processes
$ swarm -f file.swarm --time=4-00:00:00
Defaults and Limits
1,000 subjobs per swarm
4,000 jobs per user max
30,000 jobs in slurm max
1.5 GB/process default, 1 TB max
0 GB/disk, 800 GB max
4 hours walltime, 10 days max
batchlim and freen
Monitoring Swarms
current and running jobs:
squeue
sjobs
jobload
historical
sacct
jobhist
Stopping Swarms
scancel
Complex Examples
Rerunnable swarm
-e tests if the file exists
[[ -e file1.flag ]] || ( command1 && touch file1.flag )
[[ -e file2.flag ]] || ( command2 && touch file2.flag )
[[ -e file3.flag ]] || ( command3 && touch file3.flag )
[[ -e file4.flag ]] || ( command4 && touch file4.flag )
Complex Examples
Very long command lines
cd /data/user/project; KMER="CCCTAACCCTAACCCTAA"; \
jellyfish count -C -m ${#KMER} \
-t 32 \
-c 7 \
-s 1000000000 \
-o /lscratch/$SLURM_JOBID/39sHMC_Tumor_genomic \
<(samtools bam2fq /data/user/bam/0A4HMC/DNA/genomic/39sHMC_genomic.md.bam ); \
echo ${KMER} | jellyfish query /lscratch/$SLURM_JOBID/39sHMC_Tumor_genomic_0 \
> 39sHMC_Tumor_genomic.telrpt.count
Complex Examples
Comments
--comment-char, --no-comment
# This is for the first file
command -i infile1 –o outfile1 –p 2 –r /path/to/file
# This is for the next file
command –i infile2 –o outfile2 –p 2 –r /path/to/another/file
Complex Examples
Environment variables
Defined BEFORE the job; passed to the swarm:
Defined WITHIN the job; part of the swarmfile
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 1
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 2
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 3
export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 4
$ swarm –f file.swarm --sbatch "--
export=FILE=/path/to/file,DIR=/path/to/dir,VAL=12345"
Blank Space for More Examples
 
Questions?  Comments?
staff@helix.nih.gov
Slide Note
Embed
Share

Swarm is a wrapper script designed to facilitate running individual commands on a cluster environment like Biowulf. This tool streamlines job execution by allocating cores instead of nodes, ensuring all resources are efficiently utilized. The provided images and explanations shed light on the architecture, differences in Biowulf versions, basic usage, standard options, and autothreading features of Swarm.

  • Cluster management
  • Biowulf
  • Job arrays
  • Parallel processing
  • Automation

Uploaded on Feb 28, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Swarm on the Biowulf2 Cluster Dr. David Hoover, SCB, CIT, NIH staff@hpc.nih.gov September 24, 2015

  2. What is swarm? Wrapper script that simplifies running individual commands on the Biowulf cluster

  3. Documentation 'man swarm' 'swarm --help' https://hpc.nih.gov/apps/swarm.html

  4. Differences in Biowulf 1 -> 2 Jobs are allocated cores, rather than nodes All resources must be allocated All swarms are job arrays

  5. Architectural Digest nod e socket core cpus, hyperthreads

  6. Simple swarm swarm 2 1 3 0 0 hostname ; uptime swarm -t 2 subjobs 1 hostname ; uptime 2 hostname ; uptime 3 hostname ; uptime single-threaded swarm

  7. Simple swarm swarm -t 3 1 2 3 0 0 hostname ; uptime swarm -t 4 subjobs 1 hostname ; uptime 2 hostname ; uptime 3 hostname ; uptime multi-threaded swarm

  8. Basic swarm usage Create swarmfile (file.swarm) cd /home/user/dir1 ; ls l cd /home/user/dir2 ; ls l cd /home/user/dir3 ; ls l cd /home/user/dir4 ; ls -l Submit and wait for output $ swarm f file.swarm 1234567 $ ls swarm_1234567_0.o swarm_1234567_1.o swarm_1234567_2.o swarm_1234567_0.e swarm_1234567_1.e swarm_1234567_2.e

  9. Standard swarm options -f : swarmfile, list of commands to run -g : GB per process (NOT PER NODE OR JOB!) -t : threads/cpus per process (DITTO!) command p 4 i /path/to/input1 o /path/to/output1 command p 4 i /path/to/input2 o /path/to/output2 command p 4 i /path/to/input3 o /path/to/output3 command p 4 i /path/to/input4 o /path/to/output4 $ swarm f file.swarm g 4 t 4

  10. -t auto Autothreading still enabled java Xmx8g jar /path/to/jarfile opt1 opt2 opt3 $ swarm f file.swarm g 4 t auto This allocates an entire node to each process

  11. -b, --bundle Bundling is slightly different swarms of > 1000 commands are autobundled --autobundle is deprecated

  12. --singleout Concatenate all .o and .e into single files Not entirely reliable! CANCELLED and TIMEOUT will lose all output Better to use --logdir (described later)

  13. Miscellaneous --usecsh --no-comment, --comment-char --no-scripts, --keep-scripts --debug, --devel, --verbose, --silent

  14. --devel $ swarm --devel -f file.swarm -b 4 -g 8 -v 4 ------------------------------------------------------------ SWARM subjob 0: 4 commands (1 cpu, 8.00 gb) subjob 1: 4 commands (1 cpu, 8.00 gb) subjob 2: 4 commands (1 cpu, 8.00 gb) subjob 3: 4 commands (1 cpu, 8.00 gb) ------------------------------------------------------------ 4 subjobs, 16 commands, 0 output file 16 commands run in 4 subjobs, each command requiring 8 gb and 1 thread, running 4 processes serially per subjob sbatch --array=0-3 --job-name=swarm -- output=/home/user/test/swarm_%A_%a.o -- error=/home/user/test/swarm_%A_%a.e --cpus-per-task=1 -- mem=8192 --partition=norm --time=16:00:00 /spin1/swarm/user/I8DQDX4O.batch

  15. No more .swarm directories swarm scripts are written to a central, shared area Each user has their own subdirectory $ tree /spin1/swarm/user /spin1/swarm/user 2341529 cmd.0 cmd.1 cmd.2 cmd.3 2341529.batch

  16. --license --license replaces -R or --resource $ swarm f file.swarm --license=matlab

  17. --module --module now requires comma-delimited list, rather than space delimited list: $ swarm f file.swarm --module python,samtools,bwa,tophat

  18. --gres --gres stands for "generic resources" Is used for allocating local disk space Replaces --disk $ swarm f file.swarm --gres=lscratch:50 /lscratch/$SLURM_JOBID Above example gives 50GB of scratch space See https://hpc.nih.gov/docs/userguide.html#local

  19. -p, --processes-per-subjob For single-threaded commands -p can only be 1 or 2 0 2 0 hostname ; uptime swarm -p 2 subjobs 1 3 1 hostname ; uptime 2 hostname ; uptime 3 hostname ; uptime

  20. --logdir Redirects .o and .e files to a directory The directory must first exist $ mkdir /data/user/trashbin $ swarm f files.swarm --logdir /data/user/trashbin 1234567 $ ls file.swarm $ ls /data/user/trashbin swarm_1234567_0.o swarm_1234567_1.o swarm_1234567_2.o swarm_1234567_0.e swarm_1234567_1.e swarm_1234567_2.e

  21. --time All jobs must have a walltime now --time for swarm is per command, not per swarm or per subjob --time multiplied by bundle factor $ swarm -f file.swarm --devel -- 32 commands run in 32 subjobs, each command requiring 1.5 gb and 1 thread, allocating 32 cores and 64 cpus sbatch --array=0-31 ... -- --time=01:00:00 time=01:00:00 /spin1/swarm/hooverdm/iMdaW6dO.batch $ swarm -f file.swarm --devel -- --time=01:00:00 time=01:00:00 - -b 4 32 commands run in 8 subjobs, each command requiring 1.5 gb and 1 thread, running 4 processes serially per subjob sbatch --array=0-7 ... -- --time=04:00:00 time=04:00:00 /spin1/swarm/hooverdm/zYrUbkiO.batch --time=01:00:00 time=01:00:00 b 4

  22. Primary sbatch options --job-name --dependency --time, --gres --partition --qos

  23. ALL sbatch options --sbatch $ swarm -f file.swarm --sbatch "--mail-type=FAIL -- export=var=100,nctype=12 --workdir=/data/user/test" Type 'man sbatch' for more information

  24. --prologue and --epilogue way too difficult to implement conflicts with --prolog and --epilog options to srun

  25. --W block=true sbatch does not allow blocking must use srun instead

  26. -R, --resource gpfs is available on all nodes Replaced by a combination of --license, --gres, and --constraint

  27. Examples single-threaded commands $ swarm f file.swarm multi-threaded commands $ swarm f file.swarm t 4 large memory, single-threaded $ swarm f file.swarm g 32

  28. Examples >10K single-threaded commands $ mkdir /data/user/bigswarm $ swarm f file.swarm --job-name bigswarm --logdir /data/user/bigswarm wait for it and deal with the output ... $ cd /data/user/bigswarm $ cat *.e > bigswarm.err $ cat *.o > bigswarm.out $ rm *.{e,o}

  29. Examples Large temp files export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 1 export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 2 export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 3 export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 4 $ swarm f file.swarm --gres=lscratch:200

  30. Examples Dependencies $ sbatch script_1.sh 10000 $ swarm -f file.swarm --dependency=afterany:10000 10001 $ swarm -f file2.swarm --dependency=afterany:10001 10002 $ sbatch sbatch_2.sh --dependency=afterany:10002 10003

  31. Examples Long-running processes $ swarm -f file.swarm --time=4-00:00:00

  32. Defaults and Limits 1,000 subjobs per swarm 4,000 jobs per user max 30,000 jobs in slurm max 1.5 GB/process default, 1 TB max 0 GB/disk, 800 GB max 4 hours walltime, 10 days max batchlim and freen

  33. Monitoring Swarms current and running jobs: squeue sjobs jobload historical sacct jobhist

  34. Stopping Swarms scancel

  35. Complex Examples Rerunnable swarm [[ -e file1.flag ]] || ( command1 && touch file1.flag ) [[ -e file2.flag ]] || ( command2 && touch file2.flag ) [[ -e file3.flag ]] || ( command3 && touch file3.flag ) [[ -e file4.flag ]] || ( command4 && touch file4.flag ) -e tests if the file exists

  36. Complex Examples Very long command lines cd /data/user/project; KMER="CCCTAACCCTAACCCTAA"; \ jellyfish count -C -m ${#KMER} \ -t 32 \ -c 7 \ -s 1000000000 \ -o /lscratch/$SLURM_JOBID/39sHMC_Tumor_genomic \ <(samtools bam2fq /data/user/bam/0A4HMC/DNA/genomic/39sHMC_genomic.md.bam ); \ echo ${KMER} | jellyfish query /lscratch/$SLURM_JOBID/39sHMC_Tumor_genomic_0 \ > 39sHMC_Tumor_genomic.telrpt.count

  37. Complex Examples Comments # This is for the first file command -i infile1 o outfile1 p 2 r /path/to/file # This is for the next file command i infile2 o outfile2 p 2 r /path/to/another/file --comment-char, --no-comment

  38. Complex Examples Environment variables Defined BEFORE the job; passed to the swarm: $ swarm f file.swarm --sbatch "-- export=FILE=/path/to/file,DIR=/path/to/dir,VAL=12345" Defined WITHIN the job; part of the swarmfile export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 1 export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 2 export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 3 export TMPDIR=/lscratch/$SLURM_JOBID ; command opt 4

  39. Blank Space for More Examples

  40. Questions? Comments? staff@helix.nih.gov

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#