Getting started with SGE

From Darwin

Jump to: navigation, search

Contents

Cluster ? Grid computing ? What’s this ?

It’s simply a collection of several machines (the computing nodes) but acting more or less like one. The jobs are submitted from a special node, usually called the master node, which puts jobs in a queue until they can be executed, sends them from the queue to an execution device, manages them during execution time and finally logs the record of their execution when they are finished. The master node has the role of dispatching and monitoring the different jobs.

The master node is beagle and we have 128 computing nodes at the moment. Each computing node has 2 dual-core processor so we have in total 512 processors.

Ok. What’s the magic word for launching a job ?

Your servant is qsub.

[darwin@beagle ~]$ qsub myscript.sh

myscript.sh is a shell script (= a simple text file) that contains one or more commands.

You can’t submit a binary command or a perl script directly, you must wrap it in a shell script.

Straightforward example, suppose I want to run a blast, I just need to write this line in a text file that I will call myscript.sh:

# this is a comment
blastall -p blastx -d prot -i myseq.tfa

Then you can submit the job with the command described above.

[darwin@beagle ~]$ qsub myscript.sh


Can I use my favorite program/perl module ?

Every perl module which is installed on beagle can be used on the computing nodes.


Is there a way to test my script before submitting it to the cluster ?

The simplest way is to try to run your script interactively on beagle, with a reduced dataset so that it does not run for 34 days. If it is successful, you can safely submit it to the cluster with qsub.


Oooops, I launched a program with bad parameters. Can I cancel my job ?

Yes. The command is qdel followed by your jod ID:

[darwin@beagle ~]$ qdel 149


Can I cancel all my jobs at once?

Yes. The command is qdel -u followed by your user ID:

[darwin@beagle ~]$ qdel -u darwin


My program normally outputs to the screen (standard output). Where is the result ?

Standard output is redirected to a file in the same directory where you launched your qsub command. The name of the file is the name of script followed by "o" and by the number of your job.

For example, if my script is myscript.sh and the number of my job is 149 then the output will be in myscript.sh.o149. Standard error is also redirected to myscript.sh.e149.


Can I choose the name of the output file ?

Sure. You can use –o options followed by the name.

[darwin@beagle ~]$ qsub –o myres.out toto.sh

It is also possible to merge the standard error and standard output in the same output file, with the –j y option.

[darwin@beagle ~]$ qsub –j y –o myres.out toto.sh

How many processors can I use on the system ?

You can have up to xx jobs running at the same time (=xx processors). If you submit more, they will be queued until one of your jobs finishes.

MPI

Can I use SGE to run an mpi job?

Yes, you can. An example SGE job with MPI can be found here.

Status

How can I see if my job is running ?

You can see that with the qstat command. It tells you which jobs are running / waiting, since when etc. The state column indicate whether your job is waiting (qw), is being transfered (t) or is running (r).

Here we can see with the qstat command that job 29 is waiting to be dispatched on a computing node.

job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
     29     0 first.csh  darwin        qw    10/21/2007 13:48:34                         

Now is the time for transfer to the node c2-1.

job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
     29     0 first.csh  darwin        t     10/21/2007 13:48:47 c2-1.q MASTER         

And finally running.

job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
     29     0 first.csh  darwin        r     10/21/2007 13:48:47 c2-1.q MASTER         

You can check you own jobs with the -u your_nickname option. For example, if Mrs. Darwin (emma) wants to check her jobs:

[emma@beagle ~]$ qstat -u emma
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
 647557     0 all_pairs_ emma        r     11/22/2006 11:08:43 c3-1.q MASTER         
 647560     0 all_pairs_ emma        r     11/22/2006 11:11:00 c3-2.q MASTER         
 653194     0 pairwise_c emma        r     11/23/2006 10:39:51 c3-2.q MASTER         
 653191     0 pairwise_c emma        qw     11/23/2006 10:39:51 c3-3.q MASTER         
 647564     0 all_pairs_ emma        qw     11/22/2006 11:12:30 c4-2.q MASTER         

Remember: jobs with status "r" are running, jobs with status "qw" are waiting to be dispatched.


How can I have an idea of the load of the cluster ?

The qhost command shows the actual load of the nodes. The qload command will give you how many processors are used/free on the cluster:

[darwin@beagle ~]$ qload
Load    52      67%
Free    26      33%
Personal tools