UGA logo RCC: Research Computing Center
 
 
Home >
 
 
RESOURCES
SERVICES
Application & Code Development
Consulting
Grantwriting Support

Linux Cluster (rcluster)

Running Jobs on the rcluster

Using the Batch Queues

Jobs of over ten (10) minutes duration must be submitted to the queues rather than run in the background or interactively on the login node rcluster.rcc.uga.edu. Background jobs and interactive commands including cron jobs, at, $ and nohup processes as well as commands entered at the keyboard will be terminated after 10 minutes of cpu time. Graphical front ends to programs, programming tools, etc. will not be terminated.

The queueing system being used on the rcluster is Platform LSF.


Batch Queues on the rcluster

The rcluster machines have 2 CPUs each. Some of them have "dual-core" (as opposed to "single-core") CPUs, which means they behave as though they had 4 CPUs rather than 2.

Queue names beginning with "s" followed by a number submit jobs specifically to single-core machines. Queue names beginning with "d" submit jobs specifically to dual-core machines. Queues whose names start with "r" don't care which type of machine they send jobs to. Queues defined specifically for the IOB begin with "iob" and queues defined specifically for the Statistics Department begin with "stat".

NOTES:
  • Multi-thread jobs submitted to single-core machines (that is, queue names starting with "s" or "iob-s") can have two threads and those submitted to dual-core machines (that is, queue names starting with "d" or "stat-d" ) can have up to 4 threads.
  • A job might have slightly different performance on single-core and on dual-core processors. Therefore, for better load balance, we recommend that parallel MPI jobs be sent to the queues that target specifically either single-core machines or dual-core machines, and not to the queues whose names start with "r".

The batch queue can be used for serial jobs (that is, jobs that require only one processor) and for parallel jobs. The form of a queue name indicates how many processors it is limited to and the run time limit. For example, the queue r4-24h has a limit of 4 processors and 24 hours of run time per processor. To submit a job to the resource, first determine your processor number and time requirements. This will determine which queue you need.  

Here are more examples of queue names:

r1-24h

One processor, maximum run time of 24h, sends job to either single-core or dual-core machines

r1-96h

One processor, maximum run time of 96h, sends job to either single-core or dual-core machines

r1-10d

One processor, maximum run time of 10 days, sends job to either single-core or dual-core machines

r4-24h

Four processors, maximum run time of 24h per processor, sends job to either single-core or dual-core machines

s4-24h

Four processors, maximum run time of 24h per processor, sends job to single-core machines

d4-24h

Four processors, maximum run time of 24h per processor, sends job to dual-core machines

iob-s16-10d

Up to 16 processors, maximum run time of 10 days, sends job to single-core machines. For IOB's associate members only.

iob-s32-10d

Up to 32 processors, maximum run time of 10 days, sends job to single-core machines. For IOB's full members only.

stat-d16-10d

Up to 16 processors, maximum run time of 10 days, sends job to dual-core machines. For Statistics Dept. members only.

For a list of all valid queue names, please use the command queuenames from a rcluster shell prompt.

We recommend that users checkpoint their codes whenever possible to avoid losing valuable compute time if the system goes down before a job is completed. A long job that can be checkpointed can be run as a sequence of shorter jobs, which can be automatically submitted to the queue as described below in the Runchaining Jobs section. If you cannot fit your job within the established processor and runtime limits, please let us know.  


LSF Usage Information

These are the common LSF commands:

bsub

Submit a job to the queue

bkill

Cancel a queued or running job

bhold

Place a queued job on hold

bjobs

Check the status of queued and running jobs

bqueues

List all valid queue names 
 


Submitting a Batch Job to the Queue

The preferred way to submit a batch job to the queue is to use the bsub command to submit a job submission shell script. The syntax of the bsub command is:

bsub -n nprocs -q queuename -o stdout -e stderr ./shellscriptname
where
nprocs   is the number of processors (not required for serial jobs)
queuename   is the name of the batch queue
stdout   is the name of the file where the standard output is stored
stderr   is the name of the file where the standard error is stored
shellscriptname   is the name of the job submission shell script file

Examples:

1.To submit a serial job with script sub.sh to the r1-24h batch queue and have the standard output and error go to test.jobid.out and test.jobid.err, respectively, use

bsub -q r1-24h -o test.%J.out -e test.%J.err ./sub.sh

2.To submit a 4-processor parallel job with script subp.sh to the r4-24h batch queue and have the standard output and error go to test.jobid.out and test.jobid.err, respectively, use

bsub -n 4 -q r4-24h -o test.%J.out -e test.%J.err ./subp.sh

IMPORTANT NOTES:

  • The special character %J in the stdout and stderr files is replaced by the jobid number. If the files test.jobid.out and test.jobid.err do not already exist in your working directory, they will be created when the job exits the queue; otherwise the standard output and error from the job will be appended to these files when the job exits the queue.
  • If you do not specify the standard output and error files (that is, if you omit -o test.%J.out -e test.%J.err in the submission command), the standard output and error of the job will be sent to you by email.
  • The path to the shell script must be given explicitly. If the script is in the current working directory, you need to add a ./ before the script name, as shown in the example above (./sub.sh).
  • The shell script sub.sh has to be executable by user. To set the execution permission, use the following command at your rcluster prompt:
  • chmod u+x sub.sh

    Example of job submission shell scripts (sub.sh):

    In the examples below, the executable name will be called myprog and it requires input parameters to be piped in. The input parameters are in a file called myin and the output data will be stored in a file called myout. The working_directory is the path to your working directory (e.g., it could be /home/labname/username/subdir or /scratch/username/subdir )

    To run a serial job:

    #!/bin/csh
    cd working_directory
    time ./myprog < myin > myout

    To run a parallel MPI job using 4 processors (csh shell):

    #!/bin/csh
    cd working_directory
    echo $LSB_HOSTS
    cat /dev/null > mlist.$$
    foreach variable ($LSB_HOSTS)
    echo $variable >> mlist.$$
    end
    mpirun -np 4 -machinefile mlist.$$ ./myprog < myin > myout
    rm -f mlist.$$

    To run a parallel MPI job using 4 processors (bash shell):

    #!/bin/bash
    cd working_directory
    echo $LSB_HOSTS
    cat /dev/null > mlist.$$
    for variable in $LSB_HOSTS; do
    echo $variable >> mlist.$$
    done
    mpirun -np 4 -machinefile mlist.$$ ./myprog < myin > myout
    rm -f mlist.$$

    To run a parallel OpenMP job using 2 threads:

    #!/bin/csh
    cd working_directory
    setenv OMP_NUM_THREADS 2
    ./myprog < myin > myout

    NOTE: Do NOT put the job into the background with a '&' in the shell script. This will confuse the queueing system.

    The file myin in the examples above is only necessary if your program requires standard input data and the file myout is only necessary if you want the standard output data (if any) to be stored in a separate file instead of the standard output file of the batch job (test.jobid.out in the example above). If your program does not require one or both of these files, you have to remove the corresponding piping symbols ( < and/or > ) in the last line of the scripts above.

    MORE IMPORTANT NOTES:

    1. MPI jobs executed with mpirun have to use the -machinefile option as shown in the examples above, otherwise your mpi job will not use the processors assigned to it by the queueing system. Using the script above for MPI jobs, a file called mlist.xxxxx containing a list of processors assigned to your job will be generated when your job starts running and it will be deleted when your job is done. The processors used for your job will be listed in the stdout.

    2.When running threaded applications, please add the bsub option -R "span[hosts=1]" to ensure that all processors assigned to your job (up to 4 when running on dual core machines and up to 2 when running on single core ones) are on the same machine. Without this bsub option, LSF might assign processors on different machines to your job.


    Checking the Status of Jobs

    Use the bjobs command to check the status of jobs:

    bjobs [-u username] [-l] [jobid]

    where username is the user whose jobs you want to check and jobid is the JOBID of a specific job. The -l option gives long output, with detailed information about the job(s).

    For example:

    bjobs -u all shows all the jobs in the pool
    bjobs -u johndoe shows all jobs for user johndoe
    bjobs -l 10407 gives detailed information about the job with JOBID 10407

    Files Created at Job Start

    If you submit your job with the -o mystdout -e mystderr options, then the files mystdout and mystderr will be created when your job starts running, unless they already exist. In the latter case, the stdout and stderr of the job will be appended to the corresponding files. If you would like to have the jobid number incorporated into the stdout and stderr file names, use the special character %J in these file names.

    If the -o and -e options are not specified at job submission, the stdout and stderr of the job will be sent to you by email to your rcluster account and rcluster will automatically forward it to the email address that you listed when you requested your rcluster account (for example, your ugamail or departmental account). The sender of the email is LSF. You might want to check whether your email server flags such messages as spam and filter them out. To ensure that this does not happen, you might want to whitelist messages sent by LSF.


    Canceling/Removing a Job

    Use the bkill command to cancel/remove a job from the job pool:

    bkill [-u username] jobid [jobid]

    For example:

    bkill 10408 cancels your job with JOBID 10408
    bkill 10408 10409 cancels your jobs with JOBIDs 10408 and 10409
    bkill -u your_user_id cancels all jobs you have in the queue

    Receiving an Email when Job Terminates

    When you submit a batch job with bsub without the -o and -e options, you will receive the standard output and standard error of the job by email when the job terminates (whether it completes successfully or not). You can add the bsub option -N to have the standard output of the LSF job (not of the application) sent to you when the job terminates. The standard output of the application and the standard error of the job can still be written to files specified by the -o and -e options, respectively. For example:

    bsub -n 4 -q r4-24h -o out.%J -e err.%J -N ./sub.sh

    The 4 processor job running on the r4-24h queue will write the standard output of the application in the file out.jobid, write the standard error of the job in the file err.jobid, and it will send the standard output of the batch job (exit code, CPU time used, node used, etc) to the user's preferred email address.


    Runchaining Jobs

    We have found that a common need is to be able to run the same job over and over. For instance when you need to do a large number of iterations, you run so many and write in a data set the information needed to restart the job where it left off. When the job is restarted it reads the restart information and continues where the previous execution left off.

    To have one job automatically submit the next one once it finishes, you can add the following lines at the end of your job submission script:

    bsub -n nprocs -q queuename -o stdout -e stderr ./next_script_name
    exit


    Example: sub1.sh

    In the examples below we assume that the executable myprog does not require any standard input. The working directory is assumed to be /home/labname/username/subdirectory.

    Serial job using csh (tcsh):
    #!/bin/csh
    cd /home/labname/username/suddirectory
    time ./myprog
    bsub -q r1-24h -o sub.%J.out -e sub.%J.err ./sub2.sh
    exit

    Parallel job using csh (tcsh) :
    #!/bin/csh
    cd /home/labname/username/subdirectory
    echo $LSB_HOSTS
    cat /dev/null > mlist.$$
    foreach variable ($LSB_HOSTS)
    echo $variable >> mlist.$$
    end
    mpirun -np 4 -machinefile mlist.$$ ./myprog
    rm -f mlist.$$
    bsub -n 4 -q r4-24h -o sub.%J.out -e sub.%J.err ./sub2.sh
    exit

     

    First the script sub1.sh is submitted to the queue. Once it finishes running, it automatically submits script sub2.sh to the queue. This script can in turn submit sub3.sh to the queue when it completes, and so on. For this procedure, the user can prepare a sequence of scripts, which will then be submitted one at a time to the queue and run in sequence. Alternatively, the script sub1.sh can resubmit itself back to the queue once it finishes running. This would create an "infinite loop", a situation that is not recommended. To break the infinite loop, the user can set some termination rules for the job resubmission process.

    Example of a termination rule:

    One way to break out of an infinite job resubmission loop is to have the code generate a file when the program finally "converges" (or when it completes a predetermined number of steps, for example). Let us call this file finalresults.txt. The job submission script sub.sh checks whether the file finalresults.txt exists. If it does not, then the script sub.sh is submitted to the queue again, otherwise the script simply exits and the resubmission chain is terminated. A simple script sub.sh that accomplishes this is the following:

    Serial job using csh (tcsh):

    #!/bin/csh
    cd /home/labname/username/subdirectory
    time ./myprogram
    if ( ! -e finalresults.txt ) then
    bsub -q r1-24h -o mystdout -e mystderr ./sub.sh
    endif
    exit

    Serial job using ksh (bash):

    #!/bin/ksh
    cd /home/labname/username/subdirectory
    time ./myprogram
    if [ ! -f finalresults.txt ]
    then
    bsub -q r1-24h -o mystdout -e mystderr ./sub.sh
    fi
    exit

    Running an Interactive Job

    We have set aside one dual core dual processor node (4 CPUs) called inter1 for interactive jobs. This node is not part of the queueing system. To access this node, first login to rcluster.rcc.uga.edu and from there use ssh to connect to inter1:

    rcluster>  ssh inter1

    Your prompt on inter1 will not read inter1, it will read for example compute-2-12, or a similar name.

    A single processor executable (a.out) can be run on inter1 as follows:

    compute-2-12>  ./a.out

    Or run the code using 'nohup' in order to be able to logout without interrupting the running job:

    compute-2-12> nohup ./a.out &

    To run a parallel MPI job interactively, first you need to create a file (for example, call it host.list) with the word 'inter1' (without the quotes) in it, repeated 4 times in a column. That is, the contents of host.list will be

    inter1
    inter1
    inter1
    inter1

    Put this file (host.list) in your working directory and then run the MPI program a.out as follows (e.g. using 4 processors):

    compute-2-12>  mpirun -np 4 -machinefile host.list ./a.out

    compute-2-12>  nohup mpirun -np 4 -machinefile host.list ./a.out &

    Because this node has a total of 4 CPUs, users should not run parallel jobs that use more than 4 processors or threads.

    This node should only be used for short jobs (for example, for debugging purposes) and for those that cannot be run on the batch queueing system (for example, if the job requires an X windows front-end). The load of the node can be monitored using top or w.

     

     
    Partnering with UGA