[Bioclusters] PBS abnormal after a failed node
zhu at iastate.edu
Mon Jun 9 14:32:09 EDT 2008
We have a situation where PBS queue hang after a failed node:
Last week we had a bad node which failed NFS mount of shared drives.
After numerous efforts we (with helps of the vender) determine that it's either a bad motherboard or bad node-disk. While that's being fixed, I tried to make the PBS jobs queue without this node, by
(1) > pbsnodes -o node006
which gives error: Error marking node node006 - Unauthorized Request
(I was as ROOT, 'su - root')
(2) Deleted the line for the node in:
and restarted PBS:
which appear started alright.
Now the problem is -- all jobs queued (by qsub) are hanging there without getting into any node process. I tried to delete all queue and resubmit but the results are the same. Any hint what could be the problem?
Thanks in advance,
Zhi-Liang Hu (PhD)
Assistant to NAGRP Bioinformatics Coordinators,
National Animal Genome Research Program,
Department of Animal Science,
Center for Integrated Animal Genomics,
Iowa State University
Tel: 901-759-0643 (H,O) 901-212-2820 (C)
"Not everything that counts can be counted, and
not everything that can be counted counts."
"If you torture the data long enough,
it will confess." -- Ronald Coase
More information about the Bioclusters