[Bioclusters] PBS abnormal after a failed node

Zhiliang Hu zhu at iastate.edu
Thu Jun 12 08:27:07 EDT 2008


Thanks Joe,

I did it as 'root':

[root at cluster ~]# qmgr -c "set node node006 state=offline"
qmgr obj=node006 svr=default: Unauthorized Request 

[root at cluster ~]# qmgr
Max open servers: 4
Qmgr: set node node006 state=offline
qmgr obj=node006 svr=default: Unauthorized Request

Any idea why is the error?

Also, After I remove a node from /var/spool/torque/server_priv/nodes,
restart pbs, the 'pbsnodes' shows it disappeared in the list.  However the queued jobs still don't get into any node.  I think we have a bigger problem ... will update later.

Thanks!
Zhiliang


At 08:57 AM 6/10/2008 -0500, Greenseid, Joseph M. wrote:
>did you try to mark the node offline in qmgr (qmgr -c "set node node006 state=offline")?  that's how i mark my nodes offline if there are problems.
> 
>after you deleted the node from the nodes file, does pbsnodes still list it?  if so, torque may have the node's name stored somewhere else that you missed.
> 
>--Joe
>
>________________________________
>
>From: bioclusters-bounces at bioinformatics.org on behalf of Zhiliang Hu
>Sent: Mon 6/9/2008 2:32 PM
>To: HPC in Bioinformatics
>Subject: [Bioclusters] PBS abnormal after a failed node
>
>
>
>We have a situation where PBS queue hang after a failed node:
>
>Last week we had a bad node which failed NFS mount of shared drives.
>After numerous efforts we (with helps of the vender) determine that it's either a bad motherboard or bad node-disk.  While that's being fixed, I tried to make the PBS jobs queue without this node, by
>(1) > pbsnodes -o node006
>which gives error: Error marking node node006 - Unauthorized Request
>(I was as ROOT, 'su - root')
>
>(2) Deleted the line for the node in:
>  /var/spool/torque/server_priv/nodes
>and restarted PBS:
>  /etc/init.d/pbs stop
>  /etc/init.d/pbs start
>which appear started alright.
>
>Now the problem is -- all jobs queued (by qsub) are hanging there without getting into any node process. I tried to delete all queue and resubmit but the results are the same.  Any hint what could be the problem?
>
>Thanks in advance,
>
>Zhiliang
>
>--
>Zhi-Liang Hu (PhD)
>Associate Scientist,
>Assistant to NAGRP Bioinformatics Coordinators,
>National Animal Genome Research Program,
>Department of Animal Science,
>Center for Integrated Animal Genomics,
>Iowa State University
>Tel: 901-759-0643 (H,O) 901-212-2820 (C)
>Web: http://www.animalgenome.org <http://www.animalgenome.org/> 
>
>"Not everything that counts can be counted, and
>    not everything that can be counted counts."
>
>"If you torture the data long enough,
>it will confess."  -- Ronald Coase
>
>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>http://www.bioinformatics.org/mailman/listinfo/bioclusters
>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>http://www.bioinformatics.org/mailman/listinfo/bioclusters




More information about the Bioclusters mailing list