[Bioclusters] Queue Problems / Dead Processes

bioclusters@bioinformatics.org bioclusters@bioinformatics.org
Tue, 15 Apr 2003 11:31:27 -0500


If your problem is NFS related, perhaps this howto can help you?

http://gridengine.sunsource.net/howto/nfsreduce.html

-Bonnie



|---------+------------------------------------>
|         |           "andy law (RI)"          |
|         |           <andy.law@bbsrc.ac.uk>   |
|         |           Sent by:                 |
|         |           bioclusters-admin@bioinfo|
|         |           rmatics.org              |
|         |                                    |
|         |                                    |
|         |           04/15/2003 10:49 AM      |
|         |           Please respond to        |
|         |           bioclusters              |
|         |                                    |
|---------+------------------------------------>
  >-------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                               |
  |        To:      "'bioclusters@bioinformatics.org'" <bioclusters@bioinformatics.org>                                           |
  |        cc:                                                                                                                    |
  |        Subject: RE: [Bioclusters] Queue Problems / Dead Processes                                                             |
  >-------------------------------------------------------------------------------------------------------------------------------|




Joe,

Thanks for the reply.

To clarify a bit, the hangup seems random - it's not a particular node and
it's not a particular job that we can determine. If we submit 10,000
sequences to be blast searched, we usually find that 1 or 2 nodes will end
up in this irritating state. As far as the output goes, all the jobs
normally return something, so it looks as if the search part is complete
and the execution host process (blastall) is failing to terminate
correctly.

When we run vmstat on the affected nodes, there is no swapping going on
(there is 4GB physical RAM on each dual processor node) and there are no
more interrupts on the affected nodes than there are on unaffected nodes
(around 100 per second - is that about right?)

A typical affected node that I'm looking at right now has 6 blocked
processes and 2 'swapped but otherwise runnable'. There are no blocked
processes on unaffected nodes.

On affected nodes, sge_execd is reported by top to be in 'uninterruptible
sleep'. There is also a single sge_shepherd process on affected nodes in
the same state.

top reports that the blastall process is taking 99.9% CPU time, but no
other resources (no memory). It has priority 20, nice 0.

vmstat reports that all of the CPU usage is 'system' rather than 'user'.

Attempts to kill the blastall process using 'kill -KILL' (or -QUIT or -INT)
all send the signal quite happily but the process refuses to go away.

The queue is in an alarm state and we can't seem to get it back without a
hard reboot. Soft reboot is ignored i.e. the machine will not soft reboot.

kernel is 2.4.2-2smp. RedHat linux 7.1 (SeaWolf). Dual 1.2GHZ P-IIIs with
4GB RAM and a single 80GB IDE drive per node.

One extra clue that David just pointed out to me is that attempts to stop
the sge installation (using rcsge) on the affected node hangs the terminal
session. Attempts to list files in the directory containing the rcsge
executable also hang the terminal session on the affected node. This
directory is NFS mounted from the head node. Perhaps we are having an NFS
problem?

Curiously, we can list the directories on that NFS mount, but not look at
the contents of the files in it.

Any thoughts on that?

Later,

Andy


Dr. Andy Law
--------------------
Head of Bioinformatics - Roslin Institute

Unfortunately, legal niceties require me to add the following to this
message...

The information contained in this e-mail (including any attachments) is
confidential and is intended for the use of the addressee only.   The
opinions expressed within this e-mail (including any attachments) are the
opinions of the sender and do not necessarily constitute those of Roslin
Institute (Edinburgh) ("the Institute") unless specifically stated by a
sender who is duly authorised to do so on behalf of the Institute.

> -----Original Message-----
> From: Joseph Landman [mailto:landman@scalableinformatics.com]
> Sent: 15 April 2003 03:56
> To: biocluster
> Subject: Re: [Bioclusters] Queue Problems / Dead Processes
>
>
> Hi David:
>
>   Before you start a blast job on that host, try to start up
> a copy of
>
>            vmstat 1 | tee troubled_queue
>
> so that we can capture what the machine state is.  Also, I usually
> recommend grabbing the atop program
> (ftp://ftp.atcomputing.nl/pub/tools/linux, 1.9-1 is the
> latest).  If the
> node is unresponsive after getting the jobs in the queue,
> could you help
> us understand what the user load looks like?  A mostly
> unresponsive host
> consuming 99.9 % of CPU sounds a bit like a system which is swapping
> hard.  The symptoms of this would be very sluggish response,
> system CPU
> usage times in the 50+% (depends upon system and configurations).  If
> you are hammering the disk, and it is an IDE disk, you may be
> swamped by
> interrupts.  It might be worth looking to see if UDMA is available and
> turned on.
>
> Which kernel version, what configuration of disk, memory, CPU, what
> network card, how configured, etc.?  I dont think there is enough
> information to distinguish between a node issue and an SGE issue.  If
> the same jobs go to another node (identical or similar), do you see
> similar results?
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC
> email: landman@scalableinformatics.com
>   web: http://scalableinformatics.com
> phone: +1 734 612 4615
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
_______________________________________________
Bioclusters maillist  -  Bioclusters@bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters