[Bioclusters] PBS abnormal after a failed node

Zhiliang Hu zhu at iastate.edu
Thu Aug 14 14:39:44 EDT 2008


I am picking up this problem after we have fixed all hardware problems:

Symptom: Jobs submitted either hang in queue, or disappears without 
errors/results (If I ran the same job manually with 'mpirun', it works
fine).

(1) When I used "qsub -l nodes=7:ppn=2 ...", it complains "Not enough nodes 
available" while we do have 8 nodes including the head; all are reachable.

(2) When I used "qsub -l nodes=6:ppn=2 ...", the job disappeared quickly 
from the queue but nothing came out of the run (no error, no results).

I used 'tracejob' to look for problems but didn't get much out of it:

In case of (1):
----------------
08/13/2008 13:16:17 S Job Queued at request of zhu at nagrp2.ansci.iastate.edu,
                      owner=zhu at nagrp2.ansci.iastate.edu,
                      job name = BLST255842058.sh, queue=default
08/13/2008 13:16:17 S Job Modified at request of Scheduler at nagrp2.ansci.iastate.edu
08/13/2008 13:16:17 L Not enough nodes available
08/13/2008 13:16:17 S enqueuing into default, state 1 hop 1
08/13/2008 13:16:17 A queue=default

In case of (2):
----------------
08/13/2008 13:17:52 S enqueuing into default, state 1 hop 1
08/13/2008 13:17:52 S Job Queued at request of zhu at nagrp2.ansci.iastate.edu,
                      owner=zhu at nagrp2.ansci.iastate.edu,
                      job name = BLST255842058.sh, queue=default
08/13/2008 13:17:52 S Job Modified at request of Scheduler at nagrp2.ansci.iastate.edu
08/13/2008 13:17:52 L Job Run
08/13/2008 13:17:52 S Job Run at request of Scheduler at nagrp2.ansci.iastate.edu
08/13/2008 13:17:52 A queue=default
08/13/2008 13:17:52 A user=zhu group=zhu jobname=BLST255842058.sh queue=default
                      ctime=1218651472 qtime=1218651472 etime=1218651472 
                      start=1218651472 exec_host=node001/1+node001/0 
                      Resource_List.neednodes=6:ppn=2
                      Resource_List.nodect=6 Resource_List.nodes=6:ppn=2 
08/13/2008 13:17:53 S Exit_status=1 resources_used.cput=00:00:00 
                      resources_used.mem=0kb
                      resources_used.vmem=73644kb resources_used.walltime=00:00:01
08/13/2008 13:17:53 A user=zhu group=zhu jobname=BLST255842058.sh queue=default
                      ctime=1218651472 qtime=1218651472 etime=1218651472 
                      start=1218651472 exec_host=node001/1+node001/0 
                      Resource_List.neednodes=6:ppn=2
                      Resource_List.nodect=6 Resource_List.nodes=6:ppn=2 
                      session=21827
                      end=1218651473 Exit_status=1 resources_used.cput=00:00:00
                      resources_used.mem=0kb resources_used.vmem=73644kb
                      resources_used.walltime=00:00:01
------
Does it say anything that I fail to understand relating to the problem?
Or something more I should try?

Thanks!
Zhiliang




More information about the Bioclusters mailing list