[Bioclusters] Queue Problems / Dead Processes

Tue, 15 Apr 2003 12:37:09 -0400

andy law (RI) wrote:

[...]

> kernel is 2.4.2-2smp. RedHat linux 7.1 (SeaWolf). Dual 1.2GHZ P-IIIs with 4GB RAM and a single 80GB IDE drive per node.
> 
> One extra clue that David just pointed out to me is that attempts to stop the sge installation (using rcsge) on the affected node hangs the terminal session. Attempts to list files in the directory containing the rcsge executable also hang the terminal session on the affected node. This directory is NFS mounted from the head node. Perhaps we are having an NFS problem? 
> 
> Curiously, we can list the directories on that NFS mount, but not look at the contents of the files in it.
> 
> Any thoughts on that?

Absolutely.  I have seen something like this when the NFS server fails 
to respond to an active process.  What are your mount options for the 
file system?  Are you using regular /etc/fstab based hard mounts or 
automounts?

I use options like

			bg,soft,intr,rsize=8192,wsize=8192

on my NFS mounts.  At minimum, you want the intr (allows interruption of 
  NFS based IO via signals).  This might make the SGE killable.

I would also look in the logs on the compute node (and head node) to see 
if you get messages like

	NFS server xxxxx not responding

(client side), and odd NFS messages on the server side.

Is this a PC/Linux based NFS server?  Is it under heavy load?  What are 
the details of it (running which kernel, how many NFSD's, network 
configs, etc?).

I think we are on the right track here (NFS).

Note also:  Many good fixes are in place post 2.4.2 for NFS and mounts. 
  You might wish to consider (eventually) an upgrade of the compute 
nodes to a later kernel.

> 
> Later,
> 
> Andy
> 
> 
> Dr. Andy Law
> --------------------
> Head of Bioinformatics - Roslin Institute
> 
> Unfortunately, legal niceties require me to add the following to this message...
> 
> The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only.   The opinions expressed within this e-mail (including any attachments) are the opinions of the sender and do not necessarily constitute those of Roslin Institute (Edinburgh) ("the Institute") unless specifically stated by a sender who is duly authorised to do so on behalf of the Institute.
> 
> 
>>-----Original Message-----
>>From: Joseph Landman [mailto:landman@scalableinformatics.com]
>>Sent: 15 April 2003 03:56
>>To: biocluster
>>Subject: Re: [Bioclusters] Queue Problems / Dead Processes
>>
>>
>>Hi David:
>>
>>  Before you start a blast job on that host, try to start up 
>>a copy of 
>>
>>	vmstat 1 | tee troubled_queue
>>
>>so that we can capture what the machine state is.  Also, I usually
>>recommend grabbing the atop program
>>(ftp://ftp.atcomputing.nl/pub/tools/linux, 1.9-1 is the 
>>latest).  If the
>>node is unresponsive after getting the jobs in the queue, 
>>could you help
>>us understand what the user load looks like?  A mostly 
>>unresponsive host
>>consuming 99.9 % of CPU sounds a bit like a system which is swapping
>>hard.  The symptoms of this would be very sluggish response, 
>>system CPU
>>usage times in the 50+% (depends upon system and configurations).  If
>>you are hammering the disk, and it is an IDE disk, you may be 
>>swamped by
>>interrupts.  It might be worth looking to see if UDMA is available and
>>turned on.
>>
>>Which kernel version, what configuration of disk, memory, CPU, what
>>network card, how configured, etc.?  I dont think there is enough
>>information to distinguish between a node issue and an SGE issue.  If
>>the same jobs go to another node (identical or similar), do you see
>>similar results?
>>
>>Joe
>>
>>-- 
>>Joseph Landman, Ph.D
>>Scalable Informatics LLC
>>email: landman@scalableinformatics.com
>>  web: http://scalableinformatics.com
>>phone: +1 734 612 4615
>>
>>_______________________________________________
>>Bioclusters maillist  -  Bioclusters@bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615