Getting started with SGE
From Darwin
Cluster ? Grid computing ? What’s this ?
It’s simply a collection of several machines (the computing nodes) but acting more or less like one. The jobs are submitted from a special node, usually called the master node, which puts jobs in a queue until they can be executed, sends them from the queue to an execution device, manages them during execution time and finally logs the record of their execution when they are finished. The master node has the role of dispatching and monitoring the different jobs.
The master node is beagle and we have 128 computing nodes at the moment. Each computing node has 2 dual-core processor so we have in total 512 processors.
Ok. What’s the magic word for launching a job ?
Your servant is qsub.
[darwin@beagle ~]$ qsub myscript.sh
myscript.sh is a shell script (= a simple text file) that contains one or more commands.
You can’t submit a binary command or a perl script directly, you must wrap it in a shell script.
Straightforward example, suppose I want to run a blast, I just need to write this line in a text file that I will call myscript.sh:
# this is a comment blastall -p blastx -d prot -i myseq.tfa
Then you can submit the job with the command described above.
[darwin@beagle ~]$ qsub myscript.sh
Can I use my favorite program/perl module ?
Every perl module which is installed on beagle can be used on the computing nodes.
Is there a way to test my script before submitting it to the cluster ?
The simplest way is to try to run your script interactively on beagle, with a reduced dataset so that it does not run for 34 days. If it is successful, you can safely submit it to the cluster with qsub.
Oooops, I launched a program with bad parameters. Can I cancel my job ?
Yes. The command is qdel followed by your jod ID:
[darwin@beagle ~]$ qdel 149
Can I cancel all my jobs at once?
Yes. The command is qdel -u followed by your user ID:
[darwin@beagle ~]$ qdel -u darwin
My program normally outputs to the screen (standard output). Where is the result ?
Standard output is redirected to a file in the same directory where you launched your qsub command. The name of the file is the name of script followed by "o" and by the number of your job.
For example, if my script is myscript.sh and the number of my job is 149 then the output will be in myscript.sh.o149. Standard error is also redirected to myscript.sh.e149.
Can I choose the name of the output file ?
Sure. You can use –o options followed by the name.
[darwin@beagle ~]$ qsub –o myres.out toto.sh
It is also possible to merge the standard error and standard output in the same output file, with the –j y option.
[darwin@beagle ~]$ qsub –j y –o myres.out toto.sh
How many processors can I use on the system ?
You can have up to xx jobs running at the same time (=xx processors). If you submit more, they will be queued until one of your jobs finishes.
MPI
Can I use SGE to run an mpi job?
Yes, you can. An example SGE job with MPI can be found here.
Status
How can I see if my job is running ?
You can see that with the qstat command. It tells you which jobs are running / waiting, since when etc. The state column indicate whether your job is waiting (qw), is being transfered (t) or is running (r).
Here we can see with the qstat command that job 29 is waiting to be dispatched on a computing node.
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
29 0 first.csh darwin qw 10/21/2007 13:48:34
Now is the time for transfer to the node c2-1.
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
29 0 first.csh darwin t 10/21/2007 13:48:47 c2-1.q MASTER
And finally running.
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
29 0 first.csh darwin r 10/21/2007 13:48:47 c2-1.q MASTER
You can check you own jobs with the -u your_nickname option. For example, if Mrs. Darwin (emma) wants to check her jobs:
[emma@beagle ~]$ qstat -u emma job-ID prior name user state submit/start at queue master ja-task-ID --------------------------------------------------------------------------------------------- 647557 0 all_pairs_ emma r 11/22/2006 11:08:43 c3-1.q MASTER 647560 0 all_pairs_ emma r 11/22/2006 11:11:00 c3-2.q MASTER 653194 0 pairwise_c emma r 11/23/2006 10:39:51 c3-2.q MASTER 653191 0 pairwise_c emma qw 11/23/2006 10:39:51 c3-3.q MASTER 647564 0 all_pairs_ emma qw 11/22/2006 11:12:30 c4-2.q MASTER
Remember: jobs with status "r" are running, jobs with status "qw" are waiting to be dispatched.
How can I have an idea of the load of the cluster ?
The qhost command shows the actual load of the nodes. The qload command will give you how many processors are used/free on the cluster:
[darwin@beagle ~]$ qload Load 52 67% Free 26 33%
