[Bioclusters] Assembly_contd

Fri Jul 7 06:57:17 EDT 2006

Hello,

This is your problem:

On Jul 6, 2006, at 8:46 PM, francois.fauteux2 at mail.mcgill.ca wrote:

> The qstat -f command outputs:
>
> queuename                      qtype used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> ------
> all.q at mac2                  BIP   0/2       -NA-     -NA-          au
> ---------------------------------------------------------------------- 
> ------
> all.q at mac1   BIP   0/1       -NA-     -NA-          au
> ---------------------------------------------------------------------- 
> ------
> all.q at mac3                     BIP   0/2       -NA-     - 
> NA-          au

The reason you can't run jobs is that you have no available job  
slots. The reason you have no job slots is because Grid Engine may  
not be running on your three systems - or if it is running it is  
having firewall, routing or nameserver issues.

The main indication here is the "au" entry in the state column for  
each of your queue instances. State "au" means 'alarm + unreachable'  
or 'alarm + unheard' and it means that the SGE qmaster process has  
not been receiving periodic state and staus reports from the  
sge_execd daemons running on the compute nodes.

On working clusters this almost always means that SGE is simply not  
running on the cluster node and the fix is to simply restart SGE on  
the nodes in question.

Not sure about the root cause on your system, since this is a new  
install this could also be an artifact of a configuration problem or  
installation issue. Typically this would be caused by a firewall  
blocking ports that SGE uses, a routing issue or (very very common)  
some sort of hostname or DNS lookup issue.

Hopefully this is just a "sge is not running" issue -- to check this,  
login to one of the compute nodes and do a "ps ax | grep sge" command  
-- you should at least see a "sge_execd" daemon running on each  
compute node. If you don't see this, simply run the SGE startup  
script and redo the "qstat -f" command. If SGE starts up OK you will  
see the "au" status dissapear and you will see real numbers instead  
of "-NA-".