[Bioclusters] Queue Problems / Dead Processes

14 Apr 2003 22:55:33 -0400

Hi David:

  Before you start a blast job on that host, try to start up a copy of 

	vmstat 1 | tee troubled_queue

so that we can capture what the machine state is.  Also, I usually
recommend grabbing the atop program
(ftp://ftp.atcomputing.nl/pub/tools/linux, 1.9-1 is the latest).  If the
node is unresponsive after getting the jobs in the queue, could you help
us understand what the user load looks like?  A mostly unresponsive host
consuming 99.9 % of CPU sounds a bit like a system which is swapping
hard.  The symptoms of this would be very sluggish response, system CPU
usage times in the 50+% (depends upon system and configurations).  If
you are hammering the disk, and it is an IDE disk, you may be swamped by
interrupts.  It might be worth looking to see if UDMA is available and
turned on.

Which kernel version, what configuration of disk, memory, CPU, what
network card, how configured, etc.?  I dont think there is enough
information to distinguish between a node issue and an SGE issue.  If
the same jobs go to another node (identical or similar), do you see
similar results?

Joe

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615