[Bioclusters] gpfs overload on ibm bladecenter cluster

Guy Coates gmpc at sanger.ac.uk
Thu Jan 26 04:28:03 EST 2006


On Thu, 26 Jan 2006, Hershel Safer wrote:

> We're running two small IBM BladeCenter clusters under SuSE, with GPFS for (we hope) fast file
> I/O. It seems to us that when user processes on a blade are particularly memory intensive, and
> GPFS needs to compete for a resource (memory in this case), GPFS most likely won't survive the
> competition and will die.

Recent kernels have an entry in

/proc/<PID>/oom_adj

If you echo a low number in there (google for sensible values) it will
protect processes (eg GPFS ones) from being zapped by the
out-of-memory-killer.

You can also put a high number in there for user processes, so those are
the first against the wall, come the revolution.

You can also enforce per-process memory limits (/etc/security/limits.conf)
or with your job schedular, if you run one.

You might also consider not running jobs on the machines which are GPFS
NSD servers.


We primarily use job-schedular enforced limits, which seem to work well
for us.

Cheers,

Guy




This may happen on one or more nodes of the cluster. The GPFS daemon
> 'mmfsd' will lose its connection to other members of the cluster and lose its GPFS filesystem
> mounts, and consequently any services that reside on GPFS will fail. The blade will not
> necessarily crash after that; it may stay afloat may even be accessible via ssh.
>
> Have others encountered this situation? How can we prevent this behavior? More generally, what
> kinds of limits do you impose on consumption of resources such as memory and CPU? Thanks,
>
> Hershel
>
>
> _______________________________________________________________________________________________________
> Hershel M. Safer, Ph.D.
> Chair, 5th European Conference on Computational Biology (ECCB '06)
> Head, Bioinformatics Core Facility
> Weizmann Institute of Science
> PO Box 26, Rehovot 76100, Israel
> tel: +972-8-934-3456 | fax: +972-8-934-6006
> e-mail: hershel.safer at weizmann.ac.il | hsafer at alum.mit.edu
> url: http://bioportal.weizmann.ac.il
>
> ***************************************************
> Plan now for ECCB '06!
> 5th European Conference on Computational Biology
> Eilat, Israel, Sept 10 -- 13, 2006
> Visit www.eccb06.org for details
>

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919



More information about the Bioclusters mailing list