MSKCC cBio Cluster User Guide

Overview

New cluster head node DNS name: hal.cbio.mskcc.org

The cBio cluster consists of 30 Exxact dual-socket E5-2665 2.4GHz nodes (32 hyperthreads/node) with 256GB of memory. Each node is configured with 4x NVIDIA GTX-680 [18 nodes], 4x GTX-Titan [10 nodes], 4x GTX-980 [1 node], or 4x GTX-TITAN-X [1 node] GPUs.

Security policies

No PHI is allowed on this cluster.
User accounts are requested internally on the HPC website at MSKCC. For external collaborator account requests, please visit: https://www.mskcc.org/collaborator-access

Getting help

If you are having problems please email hpc-request 'at' cbio.mskcc.org

System configuration

There are 30 compute nodes in the cluster, each containing

2x Intel Xeon E5-2665 2.4GHz CPUs in hyperthreaded mode, providing a total of 32 threads per node
256GB memory (16 x 16GB DDR3 1600MHz ECC/Reg server memory)
4x NVIDIA GPUs per node, either 4GB GTX-680s [18 nodes], 6GB GTX-Titans [10 nodes], 4GB GTX-980s, 12GB GTX-TITAN-Xs
10GE ethernet interfaces [Intel E10G42BTDA Server Adapter X520-DA2 10Gbps PCIe 2.0 x8 2xSFP+ ports] connected to a Dell PowerConnect 8164F 10GE 48-port switch
1GE ipKVM and IMPI2.0 compliant management interfaces connected to a Dell PowerConnect 2848 1GE 48-port switch

The compute nodes are provided by Exxact Corp. See this link for an image of the node chassis.

In addition, there are 2 compute nodes (Dell R820), each containing

4x [Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40G] in hyperthreaded mode, providing a total of 64 threads per node
512GB memory (32 x 16GB DDR3 1600MHz ECC/Reg server memory)
10GE ethernet interfaces [Intel E10G42BTDA Server Adapter X520-DA2 10Gbps PCIe 2.0 x8 2xSFP+ ports] connected to a Dell PowerConnect 8164F 10GE 48-port switch
1GE ipKVM and IMPI2.0 compliant management interfaces connected to a Dell PowerConnect 2848 1GE 48-port switch

Storage

Storage overview

Fast local filesystem storage for the cluster is provided by a GPFS filesystem hosted on Dell servers. This filesystem is intended for caching local datasets and production storage, rather than long-term data archival. This storage is shared by all cBio groups, and quotas will be enforced after a friendly user period ends.

Home directories

Home directories are located in /cbio/xxlab/home, where xxlab denotes the laboratory designation (e.g. grlab or jclab). Once the backup server is activated, home directories will be backed up via frequent snapshotted backups. Limited backup space is available, so large project datasets that change frequently should not be stored here, or else your group will rapidly run out of backup space. See "group project directories" below.

Please keep only critical files in your home directory to ensure your home directory usage is below 100GB; otherwise, you will break our frequent remote backups of user home directories. You can check home directory space usage with du -sh ~. Use "group project directories" for larger backed-up storage---there is much more space available there.

Group shared directories

Group shared directories are located in /cbio/xxlab/share, and are intended for storing software or datasets that are shared among the group. This directory and its contents are made group-writeable by default. This directory will also be backed up, but less frequently than home directories.

Group project directories

Group project directories are located in /cbio/xxlab/projects, and are intended to host large active research projects that may be shared within the laboratory. These directories will also be backed up, though less frequently than home directories.

MSK collaborators should use /cbio/xxlab/projects/collab/ to create project directories in.

Non-backup group storage

Groups also have non-backup group storage located in /cbio/xxlab/nobackup that can host large dataset mirrors that are not irreplaceable. These directories will not be backed up.

Quotas

The total storage space accessible to all groups is limited. Quotas are not currently enforced, but we will be charging for storage use on the amount of space you use at the rate of $35/TB/month. This is posted and discussed further on the internal HPC website for MSKCC.

We are currently running with un-enforced soft quotas at the user level only. A simple wrapper around the GPFS mmlsquota command is available on HAL if you type:

mskcc-ln1> cbioquota

Please note currently all the quotas below are soft quotas and are not enforced.
This is purely to give users a way of determining their disk usage quickly.

Quotas for uid=9999(username) gid=9999(usergroup)

This is your current usage in GB
                             |
                             V
                         Block Limits                  
Filesystem type             GB      quota      limit   
gpfsdev    USR              200     51200          0   

GROUP usergroup

Disk quotas for group usergroup (gid 9999):
                         Block Limits                  
Filesystem type             GB      quota      limit   
gpfsdev    GRP         no limits

It attempts to present a friendly summary of your user and future group quotas.

Local node scratch space

Each node has 3.3TB of scratch space accessible in /scratch. Note that /tmp very limited in space, and only has a few GB free, so you should use /scratch as your temporary directory instead. Node scratch directories are local only to the compute node, are never backed up, are shared amongst all groups, and are only RAID0 (striping) and therefore provide no guarantees of data persistence.

While running a batch job on the system (see "Batch Queue System Overview" for more information), the local scratch space is automatically assigned to the running job and is removed when this job finishes. The environment variable TMPDIR is set to the user's directory.

For example for job 380886 it is:

TMPDIR=/scratch/380886.mskcc-fe1.local

Important: If your application writes to /tmp instead of /scratch, it may fail when /tmp fills up. Please do not use /tmp.

Snapshots/Backups

An independent storage system (consisting of a MD3260 disk array connected to node04) stores rsync-based snapshots of the following directories:

/cbio/xxlab/home: taken every 2 hours, available at /cbio/xxlab/snap_home/, excludes **/.sge, **/nobackup, **/tmp
/cbio/xxlab/share: taken every 6 hours, available at /cbio/xxlab/snap_share/, excludes **/.sge
/cbio/xxlab/projects: taken every 12 hours, available at /cbio/xxlab/snap_projects/, excludes **/.sge, **/TCGA*.bam, **/CCLE*.bam
/cbio/xxlab/nobackup: taken every 24 hours, available at /cbio/xxlab/snap_nobackup/, excludes **/.sge, **/TCGA*.bam, **/CCLE*.bam (only continued as long there is sufficient backup space)

The same disk array also stores a crashplan-based backup of the following directories:

/cbio/grlab/{home,share}: taken continuously, restores available at request from Gunnar Rätsch, excludes ~/tmp, ~/nobackup, *.bam, *.sam*, *.fastq*, *.fq*, core.*, .mozilla
/cbio/jclab/{home,share}: taken continuously, restores available at request from Gunnar Rätsch, excludes ~/tmp, ~/nobackup
/cbio/cllab/{home,share}: taken continuously, restores available at request from Gunnar Rätsch, excludes ~/tmp, ~/nobackup, *.bam, *.fastq*, *.sam*
/cbio/cslab/{home,share}: taken continuously, restores available at request from Gunnar Rätsch, excludes ~/tmp, ~/nobackup, *.bam, *.fastq*, *.sam*
/cbio/galab/{home,share}: taken continuously, restores available at request from Gunnar Rätsch
/cbio/jxlab/{home,share}: taken continuously, restores available at request from Gunnar Rätsch

Architecture

The storage subsystem consists of

2x Dell PowerEdge R720 metadata servers containing 15K RPM SAS disks, providing metadata redundancy
4x Dell PowerEdge R620 file servers with 24GB RAM and Broadcom 57800 2x10Gb DA/SFP+ network cards
4x Dell PowerVault MD3260 SAS disk arrays each containing 60 3TB 7.2K RPM 6Gbps hot-plug SAS hard drives in 10-disk RAID5 configurations to provide fault tolerance to loss of single disks or an entire drawer of disks

System access

Obtaining access (getting an account)

HPC Request<[email protected]> with an account creation request. The following details will be needed:

full name
sponsoring cBio laboratory
email address
reason for access (eg.. grant proposal, funded grant, data sharing)
mobile phone number (in case we have to contact you in emergencies)
BIC cluster UID/GID (if you have one already)
ssh public key

SSH key details

Your ssh public key on *nix systems is generally found in your home directory under one of the following filenames:

~/.ssh/id_rsa.pub
~/.ssh/id_dsa.pub

If you do not have such a file, generate an ssh public key:

ssh-keygen -t dsa -N ''

and follow the prompts. Be aware the -N argument is creating the key without a passphrase. Consult the ssh-keygen and ssh manpages for adding a passphrase and using it during login.

Passwords

Password-based login is not currently supported. Login via activation of trusted ssh access is currently the only login method supported. Your ssh public keys for any machines you log in from must be submitted and used for access.

Logging in

You can ssh to hal.cbio.mskcc.org to log in:

ssh [email protected]

Logging in from other machines

If you would like to log into HAL from another machine, you will need to add your SSH public key to your ~/.ssh/authorized_keys file on HAL. You can do this yourself---there is no need to contact the sysadmins to do this for you.

From the new machine you would like to log in from, locate your public key (generally ~/.ssh/id_rsa.pub, or ~/.ssh/id_dsa.pub on linux or osx systems). Note that you also have private keys (~/.ssh/id_rsa and ~/.ssh/id_dsa) that should not be used here. If you cannot find the public keys, you can generate a new one using

ssh-keygen -t dsa -N ''

Your public key will look something like

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDQ8iicpuXcHZn9ppdnDxBSu9VugPXTBSke3eG6tTm+vrAHTDZaCNYAV87ejFfEyrRRISOAFSA5m6xkMki3WlC/ebI3m595GKxHL447Lme37mENZ4IV9K1X/aMkVhvOiFaEIFs6yZWteAi9VakQP5M5DG2ul8i/sJ4NNd/JU+00QvcrHc4D1DhMaNy6vbmlF7USLS6z/8NlQSWjxo+pA5HnSC//azY0ZFZLqVIqwtzZGlTe94e36BUqGyoD3ndGFLFjZbEufHxX3l47RNM+XUaM9BDUSRKeFJtWg1Cx7IWTPiKDBrhozx5yzGZSGVf1dn/Vn4p1SaDEXnbj44aam9U9 username@systemname

or

ssh-dss AAAAB3NzaC1kc3MAAACBAMoBcHWy/pWU1s3c+dgUfZMl1ldu8eXTKL27Kc5yFmQxD0PB3qLwoUb+T6HK5EVL/WM/nKS7umYFeFlNF7YZjOKnzErUVqbGi9U48coprnGl88goBqJiBQAp2mxfbl8EhPvXK58K7LqPwXoWh1ssSOGipC1+hxJUz0SjOR4zQo8hAAAAFQD/qqz9T6K13HaWtylXAuCLsxHpYQAAAIAuQ9ZG5AfmbdGlluH7nMTuyCyJSYsoddXYwL6RJtvLEaqE2vG7341bWG9j9gZzU2LED59rJUnB0HCrP9pXL957wH6ajto82aGkIOmzAzzLZh5oPWPiVkc9o3wZod86IdQwnclu77f6LY/Q+J1wzihgY+lTsqrbKN/ai5JH21BRWgAAAIBzwOaFCni2k+2IdhtMGyKW0iXqVCyXZL4h8NqoVZHv9ys+H15tegT8Hsd36DRMFcqshGl/nQEAuByXwfKy0l/a0uYFi21hyJcUR8wNv02hlM7Z7V4jkfAPd8c/5X91VLdRG18dy6tJkZQ/AZ0jAdjAAW2KdfU13pSgPSJo1F3bjw== username@systemname

Append the key (being careful to preserve it all on one line if using copy-paste) to your ~/.ssh/authorized_keys file on HAL. Do NOT delete any keys already there. You can use a text editor or cat for this purpose:

cat >> ~/.ssh/authorized_keys
{paste your key here, press enter, and press ctrl-D}

Test the new key by trying to log in from the new machine. If you are asked for a password, go back and carefully check you don't have line breaks or other issues with the added key.

What to do if you need help logging in

If you have issues with logging in, please email the support email alias of hpc-request 'at' cbio.mskcc.org.

Running jobs

Interactive (shell) jobs on the head node

While the head node is available to run short, non-memory-intensive tasks (e.g. editor sessions, archival, compilation, etc.), longer-running jobs or jobs that require more resources should be run on the nodes through the batch queue system (either in batch or interactive mode)---see below.

PLEASE DO NOT RUN MEMORY OR CPU INTENSIVE JOBS ON THE HEAD NODE. Use interactive logins to the compute nodes (see below) for this purpose.

Shell processes on the head node are currently restricted to 10 hours of runtime and 4GB of RAM.

Batch queue system overview

The batch queue system now uses the Torque resource manager with the Moab HPC suite from Adaptive Computing. The old slurm system is no longer in use.

Submitting jobs to the batch queue system

Example single CPU job with 24-hour time limit:

#!/bin/tcsh
#  Batch script for single thread CPU job.
#
# walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=24:00:00
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify queue
#PBS -q batch
#
# nodes: number of nodes
#   ppn: number of processes per node
#PBS -l nodes=1:ppn=1
#
# export all my environment variables to the job
#PBS -V
#
# job name (default = name of script file)
#PBS -N myjob
#
# mail settings (one or more characters)
# email is sent to local user, unless another email address is specified with PBS -M option 
# n: do not send mail
# a: send mail if job is aborted
# b: send mail when job begins execution
# e: send mail when job terminates
#PBS -m n
#
# filename for standard output (default = <job_name>.o<job_id>)
# at end of job, it is in directory from which qsub was executed
# remove extra ## from the line below if you want to name your own file
##PBS -o myoutput

# Change to working directory used for job submission
cd $PBS_O_WORKDIR

# Launch my program.
./myprog

Example 6-process MPI job on a single node with 12-hour time limit:

#!/bin/tcsh
#  Batch script for 6-process MPI CPU job.
#
# walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=12:00:00
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify queue
#PBS -q batch
#
# nodes: number of nodes
#   ppn: number of processes per node
#PBS -l nodes=1:ppn=6
#
# export all my environment variables to the job
#PBS -V
#
# job name (default = name of script file)
#PBS -N myjob
#
# mail settings (one or more characters)
# email is sent to local user, unless another email address is specified with PBS -M option 
# n: do not send mail
# a: send mail if job is aborted
# b: send mail when job begins execution
# e: send mail when job terminates
#PBS -m n
#
# filename for standard output (default = <job_name>.o<job_id>)
# at end of job, it is in directory from which qsub was executed
# remove extra ## from the line below if you want to name your own file
##PBS -o myoutput

# Change to working directory used for job submission
cd $PBS_O_WORKDIR

# Launch MPI job.
mpirun -rmk pbs progname

where the -rmk pbs instructs the hydra mpirun version to use the PBS resource manager kernel to take information about the number of processes to launch from PBS environment variables.

If you want to allow the MPI proceses to run on any node (rather than force them to run on the same node), you can change

# nodes: number of nodes
#   ppn: number of processes per node
#PBS -l nodes=1:ppn=6

to

# nodes: number of processes
#   tpn: number of threads per node
#PBS -l nodes=6,tpn=1

Submitting multiple jobs via array submission

Submit an array job with 80 tasks, with 10 active at a time:

qsub -t 1-80%10 script.sh

#!/bin/sh
# walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=36:00:00
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify GPU queue
#PBS -q gpu
#
# nodes: number of nodes
#   ppn: number of processes per node
#  gpus: number of gpus per node
#  GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#PBS -l nodes=1:ppn=1:gpus=1:shared
#
# export all my environment variables to the job
#PBS -V
#
# job name (default = name of script file)
#PBS -N myjob
#
# mail settings (one or more characters)
# email is sent to local user, unless another email address is specified with PBS -M option 
# n: do not send mail
# a: send mail if job is aborted
# b: send mail when job begins execution
# e: send mail when job terminates
#PBS -m n
#
# filename for standard output (default = <job_name>.o<job_id>)
# at end of job, it is in directory from which qsub was executed
# remove extra ## from the line below if you want to name your own file
##PBS -o myoutput

# Change to working directory used for job submission
cd $PBS_O_WORKDIR

# start spark workers
/some_path_to_script/some_executable $PBS_ARRAYID

Submitting GPU jobs to the batch queue system

NOTE: The wallclock limit for the gpu queue is 72 hours (72:00:00)

Single GPU jobs

Using a single GPU is easy, since the CUDA_VISIBLE_DEVICES environment variable will be set automatically so that the CUDA driver only allows access to a single GPU.

#!/bin/tcsh
#  Batch script for MPI GPU job on the cbio cluster
#  utilizing 4 GPUs, with one thread/GPU
#
# walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=12:00:00
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify GPU queue
#PBS -q gpu
#
# nodes: number of nodes
#   ppn: number of processes per node
#  gpus: number of gpus per node
#  GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#PBS -l nodes=1:ppn=1:gpus=1:shared
#
# export all my environment variables to the job
#PBS -V
#
# job name (default = name of script file)
#PBS -N myjob
#
# mail settings (one or more characters)
# email is sent to local user, unless another email address is specified with PBS -M option 
# n: do not send mail
# a: send mail if job is aborted
# b: send mail when job begins execution
# e: send mail when job terminates
#PBS -m n
#
# filename for standard output (default = <job_name>.o<job_id>)
# at end of job, it is in directory from which qsub was executed
# remove extra ## from the line below if you want to name your own file
##PBS -o myoutput

# Change to working directory used for job submission
cd $PBS_O_WORKDIR

# Launch GPU job.
./myjob

If you need to manually control CUDA_VISIBLE_DEVICES, create a file in your home directory:

touch $HOME/.dontsetcudavisibledevices

This will turn off the automatic setting of CUDA_VISIBLE_DEVICES. You will need to set this environment variable manually if you use a single GPU, using:

export CUDA_VISIBLE_DEVICES=`cat $PBS_GPUFILE | awk -F"-gpu" '{ printf A$2;A=","}'`

Selecting specific GPU types

You can add a constraint to your Torque request to specify which class of GPU you want to run on, though limited availability may mean that your job takes longer to start. Available classes are:

gtx680: NVIDIA GTX-680 (18 nodes)
gtxtitan: NVIDIA GTX-TITAN (10 nodes)
gtx980: NVIDIA GTX-980 (1 node)
gtxtitanx: NVIDIA GTX-TITAN-X (1 node)

For example, to request a GTX-TITAN-X for interactive benchmarking, use something like

qsub -I -l walltime=02:00:00,nodes=1:ppn=1:gpus=1:gtxtitanx:exclusive -l mem=4G -q active

Submitting multi-GPU MPI jobs

Since the GPUs available on different nodes may differ, you will need to manually control CUDA_VISIBLE_DEVICES. Create a file in your home directory:

touch $HOME/.dontsetcudavisibledevices

You should use conda or miniconda to install clusterutils to add some helpful MPI configfile building scripts to your path:

conda install -c omnia clusterutils

You will then need to use build_mpirun_configfile to set the CUDA_VISIBLE_DEVICES for each process individually based on the $PBS_GPUFILE contents.

Example running an MPI job across 4 GPUs:

#!/bin/tcsh
#  Batch script for MPI GPU job on the cbio cluster
#  utilizing 4 GPUs, with one thread/GPU
#
# walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=12:00:00
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify GPU queue
#PBS -q gpu
#
# nodes: number of nodes
#   ppn: number of processes per node
#  gpus: number of gpus per node
#  GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#PBS -l nodes=1:ppn=4:gpus=4:shared
#
# export all my environment variables to the job
#PBS -V
#
# job name (default = name of script file)
#PBS -N myjob
#
# mail settings (one or more characters)
# email is sent to local user, unless another email address is specified with PBS -M option 
# n: do not send mail
# a: send mail if job is aborted
# b: send mail when job begins execution
# e: send mail when job terminates
#PBS -m n
#
# filename for standard output (default = <job_name>.o<job_id>)
# at end of job, it is in directory from which qsub was executed
# remove extra ## from the line below if you want to name your own file
##PBS -o myoutput

# Change to working directory used for job submission
cd $PBS_O_WORKDIR

# Set CUDA_VISIBLE_DEVICES for this process
build_mpirun_configfile progname args

# Launch MPI job.
mpirun -configfile configfile

If you don't need the GPUs to be on a single node, you can change

# nodes: number of nodes
#   ppn: number of processes per node
#  gpus: number of gpus per node
#  GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#PBS -l nodes=1:ppn=4:gpus=4:shared

to

# nodes: number of process sets
#   tpn: number process sets to launch on each node
#  gpus: number of gpus per process set
#  GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#PBS -l nodes=4,tpn=1,gpus=1:shared

Submitting jobs with dependencies

Sometimes it is necessary to specify dependencies on the order in which jobs must be run. For example, if there are two tasks, job_A and job_B, and job B must only be executed after job A is completed, this can be achieved by specifying a job dependency. To do so, keep track of the job IDs and pass these to qsub:

#submit job_A, keep track of its ID so you can add a dependency
job_A_ID=`qsub job_A.sh`;

#submit job_B, specifying that it should only be executed after job_A is completed
qsub -W depend=after:${job_A_ID} job_B.sh

Descriptions of other dependencies that can be specified are available from the 'qsub -W' man page.

Managing jobs

Torque job monitoring commands

qstat will print out information about what is running in the queue, and takes typical arguments such as -u username.

Example:

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
161110.mskcc-fe1           ...o_target-temp [user]          141:01:0 R gpu            
161111.mskcc-fe1           ...o_target-temp [user]          140:46:4 R gpu            
161112.mskcc-fe1           ...o_target-temp [user]          140:56:2 R gpu            
189342.mskcc-fe1           ...ssembly_iso_3 [user]          70:28:15 R gpu            
189344.mskcc-fe1           ...target-temp_2 [user]          70:24:51 R gpu            
189345.mskcc-fe1           ...target-temp_2 [user]          70:14:16 R gpu            
189346.mskcc-fe1           ...target-temp_2 [user]          70:07:30 R gpu            
189348.mskcc-fe1           ...target-temp_2 [user]          68:28:11 R gpu            
198319.mskcc-fe1           gpu              [user]          70:26:31 R batch          
198320.mskcc-fe1           gpu              [user]          70:22:06 R batch          
198664.mskcc-fe1           QLOGIN           [user]         04:31:57 R batch

qstat -a prints in an alternative format, with additional info for each job such as node and task counts, and requested memory size.

showstart <job_id> will give you an estimated starting time if the load is high. Keep in mind that the cluster runs on EST.

The Moab command showstate gives a quick overview of the cluster and running jobs.

The Moab command showres -n -g provides a useful reservation summary by node.

> showres -n -g
reservations on Fri Oct 31 13:49:46

NodeName             Type      ReservationID   JobState Task       Start    Duration  StartTime           

gpu-1-4               Job            2495114    Running    1 -2:06:43:05  3:00:00:00  Wed Oct 29 07:06:41
gpu-1-4               Job            2497256    Running    1 -2:01:13:54  3:00:00:00  Wed Oct 29 12:35:52
gpu-1-4               Job            2497264    Running    1 -2:01:00:30  3:00:00:00  Wed Oct 29 12:49:16
gpu-1-4               Job            2498684    Running    1 -1:22:30:43  3:00:00:00  Wed Oct 29 15:19:03
gpu-1-4               Job            2499850    Running    4 -1:17:40:57  3:00:00:00  Wed Oct 29 20:08:49
gpu-1-4               Job            2499902    Running    1 -1:14:31:32  3:00:00:00  Wed Oct 29 23:18:14
gpu-1-4               Job            2499904    Running    1 -1:14:26:23  3:00:00:00  Wed Oct 29 23:23:23
gpu-1-4               Job            2499906    Running    1 -1:13:25:38  3:00:00:00  Thu Oct 30 00:24:08

Displaying summary of completed jobs

tracejob utility reads the accounting data file and produces a summary of information about the finished job. Which includes cpu-time, exec_hostname, owner-name, job-name, job-ID, Exit_status and resource requirements as specified.

Usage: tracejob -lsm <job_id>

Interactive jobs

You can use qsub to start interactive jobs using qsub -I -q active, which requests an interactive session (-I) in the special interactive high-priority queue (-q active).

Example: To start an interactive job on one core with time limit of 1 hour:

qsub -I -q active -l walltime=01:00:00 -l nodes=1:ppn=1

Example: To start an interactive job running bash on one core with a time limit of 4 hours and 4GB of memory:

qsub -I -q active -l walltime=04:00:00 -l nodes=1:ppn=1 -l mem=4G bash

Example: To start an interactive job with time limit of 1 hour, requesting one GPU in "shared" mode:

qsub -I -q active -l walltime=01:00:00 -l nodes=1:ppn=1:gpus=1:shared

To request a particular kind of GPU, you can specify either 'gtx680' or 'gtxtitan'.

Example: To start an interactive job with time limit of 1 hour, requesting one GTX-680 GPU in "shared" mode:

qsub -I -q active -l walltime=01:00:00 -l nodes=1:ppn=1:gpus=1:shared:gtx680

Submitting Interactive Jobs that support X11 forwarding

Sometimes an interactive job contains a graphical component. In order to forward X11 from a job running on a compute node interactively these prerequisites must exist.

First, your SSH session to HAL must be doing X11 forwarding properly to your X11 display. An example is below.

ssh -X hal.cbio.mskcc.org
mskcc-ln1 ~]$ echo $DISPLAY
localhost:15.0
mskcc-ln1 ~]$ xeyes

If you do not see big silly eyes looking at your cursor consult the manpage for ssh to make sure you are forwarding X11 properly. Sometimes the "-Y" argument is needed depending on your X11 desktop configuration.

Once that works, interactive qsub sessions support adding the "-X" flag. For example:

qsub -X -I -l walltime=02:00:00,nodes=1:ppn=1 -q active

Should start a job with forwarded X11 on the node it is scheduled on. Please note, forwarding X11 requires you keep the SSH session running to HAL. You cannot exit and expect graphical applications to appear on your desktop.

Killing jobs

Use qdel <jobid> to kill your batch job by job id.

On heavily overloaded nodes, this may take up to half an hour to actually kill and purge the job.

Array jobs have a few extra items that can be done when deleting them.

To delete the entire array job be sure to include the trailing [] syntax:

qdel 282829[]    # deletes entire array job
qdel 282829[1]   # deletes single entry in array job
qdel -t 2-5 282829[]   # deletes range of items in array job
qdel -t 1,5,6 282829[]  # deleting specific array job members

Ganglia

There is a Ganglia system installed on the system. On Mac OS one can access ganglia via the web browser using these commands (hint: create an alias; Linux: replace open with firefox or your favorite browser):

ssh -L 8081:localhost:80 140.163.0.215 -f sleep 1000
open http://localhost:8081/ganglia/

If you are located on the MSKCC network ranges, the same data is available at

http://hal.cbio.mskcc.org/ganglia/

Further useful commands

Useful Torque and Moab commands for managing and monitoring batch jobs

Cleaning up after jobs

While normally most items involving a completed job are cleaned up by the process simply exiting a couple of items unless the user cleans them up will be handled as follows.

Shared Memory Segments

If a user job leaves a shared memory segment on the system it will persist until a nightly cron job evaluates its impact. If the memory consumed by the shared memory segment is larger than 1GB the user will be emailed about it and the memory segment will be deleted. This will only be done if the shared memory segment has no processes attached to it.

Smaller shared memory segments will be removed without emailing after one week.

Scratch Disk Space

PROPOSED CHANGE FOLLOWS. THIS IS NOT IMPLEMENTED AT THIS TIME

Torque by default creates a TMPDIR variable and area for jobs to automatically use if they want. For example on a submitted job there will be automatically made a directory in /scratch similar to the below.

TMPDIR=/scratch/7048239.hal-sched1.local

In most cases when a job exits that directory will be deleted. In some situations we've seen that not happen and as such we will be starting the following cron based cleaning routines of the /scratch area soon.

The directory /scratch/shared is consider a persistent non-managed place for manually sync'd user data on a node. The cron script will NOT apply age based delete to anything in this directory. So if you have placed items in /scratch itself please consider moving them to /scratch/shared.
No directory matching the above pattern of the jobid will be touched.
The directory /scratch/docker which is where the docker images and state is kept will not be touched.
ALL OTHER items in /scratch shortly older than for a first pass of 365 days will be deleted.

CURRENT POLICY Currently no automated processes go and clean the /scratch areas on the nodes. So if your job makes use of the /scratch areas on the nodes it is up to the user to go remove leftover items there.

Globally-installed software

Installed software packages

Matlab

Matlab R2013a is installed as a module. To use it, use

module add matlab

To get an interactive login for Matlab, you can do the following:

qsub -I -l nodes=1:ppn=4:matlab -l mem=40gb -q active

Be sure to specify the active queue, or else your jobs will not start promptly. The active queue should only be used for interactive jobs, and not for running batch jobs. There is a default time limit of two hours on the active queue, though shorter time limits can be specified (e.g. -l walltime=00:30:00 for a 30-minute time limit).

(The attribute matlab in the call will make sure that you are scheduled on a node with an available license.)

You will get an interactive shell. You can start matlab with:

/opt/matlab/R2013a/bin/matlab

There are a limited number of floating and node-locked licenses which may be requested as resources in the batch queue environment. TODO: Document batch queue license requests for Matlab

Ruby Version Manager

The Ruby Version Manager (RVM) https://rvm.io/ is installed as a module and gives you many options for dealing with Ruby variations and development needs.

The recommended procedure for using RVM on the cluster is:

module initadd rvm

This will add the rvm binary to your path on subsequent logins. Then it is wish to choose a Ruby version from your .bash_profile along the lines of:

rvm use default

Or the more current ones can be chosen after viewing:

rvm list

If a user wants to self-maintain ruby versions or gems the rvm user command allows the selection of homedir maintained items. Most users will probably want to minimally select:

rvm user gemsets

Which then allows the user to add the various gems they would need easily.

Modules

The cluster makes use of a module environment for managing software packages where it is important to provide users with several different versions. The module environment makes it easy to configure your software versions quickly and conveniently.

To see the various options available, type

> module

To list currently loaded modules and versions:

> module list
Currently Loaded Modulefiles:
  1) gcc/4.8.1        2) mpich2_eth/1.5   3) cuda/5.5         4) cmake/2.8.10.2

To list all available modules that can be loaded:

> module avail

To add a new module, use module add:

> module add cuda/5.5

The number that comes after the module name followed by a slash is the version number of the software. More information about available modules can be obtained with the module show command:

> module show cuda/5.5
-------------------------------------------------------------------
/etc/modulefiles/cuda/5.5:

module-whatis    cuda 
module-whatis    Version: 5.5 beta 
module-whatis    Description: cuda toolkit 
prepend-path     PATH /usr/local/cuda-5.5/bin 
prepend-path     LD_LIBRARY_PATH /usr/local/cuda-5.5/lib:/opt/cuda-5.5/lib64 
-------------------------------------------------------------------

Requesting installation of additional software

#Datasets and Repositories

##PDB database in pdb format The PDB database is present at /cbio/jclab/share/pdb It can be retrieved using the following command:

rsync -rlpt -v -z --delete --port=33444 rsync.wwpdb.org::ftp_data/structures/divided/pdb/ <destdir>

For more information and options (such as other file formats), go to this page on wwPDB

It was last retrieved on 7 Nov 2013.

Emergency administration (Gunnar and John only)

During non-business hours, Gunnar and John can make emergency interventions by running the following commands on the head node (mskcc-ln1):

* take nodes offline:                                    sudo /opt/torque/bin/pbsnodes -o NodeName
* take nodes out of service and flag them for sysadmin   sudo /opt/torque/bin/pbsnodes -oN "text for investigation here"  NodeName
* purge jobs from the queue                              sudo /opt/torque/bin/qdel -p JobID

The purge command will remove all accounting information for that job, tracejob will only show

      purging job 9999999.mskcc-fe1 without checking MOM

but no resources used at all.

MSKCC cBio Cluster User Guide

Table of Contents

Overview

Security policies

Getting help

System configuration

Storage

Storage overview

Home directories

Group shared directories

Group project directories

Non-backup group storage

Quotas

Local node scratch space

Snapshots/Backups

Archives

Architecture

System access

Obtaining access (getting an account)

SSH key details

Passwords

Logging in

Logging in from other machines

What to do if you need help logging in

Running jobs

Interactive (shell) jobs on the head node

Batch queue system overview

Submitting jobs to the batch queue system

Submitting multiple jobs via array submission

Submitting GPU jobs to the batch queue system

Single GPU jobs

Selecting specific GPU types

Submitting multi-GPU MPI jobs

Submitting jobs with dependencies

Managing jobs

Torque job monitoring commands

Displaying summary of completed jobs

Interactive jobs

Submitting Interactive Jobs that support X11 forwarding

Killing jobs

Ganglia

Further useful commands

Cleaning up after jobs

Shared Memory Segments

Scratch Disk Space

Globally-installed software

Installed software packages

Matlab

Ruby Version Manager

Modules

Requesting installation of additional software

Emergency administration (Gunnar and John only)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally