[Bioclusters] PBS abnormal after a failed node
Greenseid, Joseph M.
Joseph.Greenseid at ngc.com
Tue Jun 10 09:57:37 EDT 2008
did you try to mark the node offline in qmgr (qmgr -c "set node node006 state=offline")? that's how i mark my nodes offline if there are problems.
after you deleted the node from the nodes file, does pbsnodes still list it? if so, torque may have the node's name stored somewhere else that you missed.
From: bioclusters-bounces at bioinformatics.org on behalf of Zhiliang Hu
Sent: Mon 6/9/2008 2:32 PM
To: HPC in Bioinformatics
Subject: [Bioclusters] PBS abnormal after a failed node
We have a situation where PBS queue hang after a failed node:
Last week we had a bad node which failed NFS mount of shared drives.
After numerous efforts we (with helps of the vender) determine that it's either a bad motherboard or bad node-disk. While that's being fixed, I tried to make the PBS jobs queue without this node, by
(1) > pbsnodes -o node006
which gives error: Error marking node node006 - Unauthorized Request
(I was as ROOT, 'su - root')
(2) Deleted the line for the node in:
and restarted PBS:
which appear started alright.
Now the problem is -- all jobs queued (by qsub) are hanging there without getting into any node process. I tried to delete all queue and resubmit but the results are the same. Any hint what could be the problem?
Thanks in advance,
Zhi-Liang Hu (PhD)
Assistant to NAGRP Bioinformatics Coordinators,
National Animal Genome Research Program,
Department of Animal Science,
Center for Integrated Animal Genomics,
Iowa State University
Tel: 901-759-0643 (H,O) 901-212-2820 (C)
Web: http://www.animalgenome.org <http://www.animalgenome.org/>
"Not everything that counts can be counted, and
not everything that can be counted counts."
"If you torture the data long enough,
it will confess." -- Ronald Coase
Bioclusters maillist - Bioclusters at bioinformatics.org
More information about the Bioclusters