[Bioclusters] PBS abnormal after a failed node

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Tue Jun 10 09:57:37 EDT 2008

did you try to mark the node offline in qmgr (qmgr -c "set node node006 state=offline")?  that's how i mark my nodes offline if there are problems.
after you deleted the node from the nodes file, does pbsnodes still list it?  if so, torque may have the node's name stored somewhere else that you missed.


From: bioclusters-bounces at bioinformatics.org on behalf of Zhiliang Hu
Sent: Mon 6/9/2008 2:32 PM
To: HPC in Bioinformatics
Subject: [Bioclusters] PBS abnormal after a failed node

We have a situation where PBS queue hang after a failed node:

Last week we had a bad node which failed NFS mount of shared drives.
After numerous efforts we (with helps of the vender) determine that it's either a bad motherboard or bad node-disk.  While that's being fixed, I tried to make the PBS jobs queue without this node, by
(1) > pbsnodes -o node006
which gives error: Error marking node node006 - Unauthorized Request
(I was as ROOT, 'su - root')

(2) Deleted the line for the node in:
and restarted PBS:
  /etc/init.d/pbs stop
  /etc/init.d/pbs start
which appear started alright.

Now the problem is -- all jobs queued (by qsub) are hanging there without getting into any node process. I tried to delete all queue and resubmit but the results are the same.  Any hint what could be the problem?

Thanks in advance,


Zhi-Liang Hu (PhD)
Associate Scientist,
Assistant to NAGRP Bioinformatics Coordinators,
National Animal Genome Research Program,
Department of Animal Science,
Center for Integrated Animal Genomics,
Iowa State University
Tel: 901-759-0643 (H,O) 901-212-2820 (C)
Web: http://www.animalgenome.org <http://www.animalgenome.org/> 

"Not everything that counts can be counted, and
    not everything that can be counted counts."

"If you torture the data long enough,
it will confess."  -- Ronald Coase

Bioclusters maillist  -  Bioclusters at bioinformatics.org

More information about the Bioclusters mailing list