On Thu, 26 Jan 2006, Hershel Safer wrote: > We're running two small IBM BladeCenter clusters under SuSE, with GPFS for (we hope) fast file > I/O. It seems to us that when user processes on a blade are particularly memory intensive, and > GPFS needs to compete for a resource (memory in this case), GPFS most likely won't survive the > competition and will die. Recent kernels have an entry in /proc/<PID>/oom_adj If you echo a low number in there (google for sensible values) it will protect processes (eg GPFS ones) from being zapped by the out-of-memory-killer. You can also put a high number in there for user processes, so those are the first against the wall, come the revolution. You can also enforce per-process memory limits (/etc/security/limits.conf) or with your job schedular, if you run one. You might also consider not running jobs on the machines which are GPFS NSD servers. We primarily use job-schedular enforced limits, which seem to work well for us. Cheers, Guy This may happen on one or more nodes of the cluster. The GPFS daemon > 'mmfsd' will lose its connection to other members of the cluster and lose its GPFS filesystem > mounts, and consequently any services that reside on GPFS will fail. The blade will not > necessarily crash after that; it may stay afloat may even be accessible via ssh. > > Have others encountered this situation? How can we prevent this behavior? More generally, what > kinds of limits do you impose on consumption of resources such as memory and CPU? Thanks, > > Hershel > > > _______________________________________________________________________________________________________ > Hershel M. Safer, Ph.D. > Chair, 5th European Conference on Computational Biology (ECCB '06) > Head, Bioinformatics Core Facility > Weizmann Institute of Science > PO Box 26, Rehovot 76100, Israel > tel: +972-8-934-3456 | fax: +972-8-934-6006 > e-mail: hershel.safer at weizmann.ac.il | hsafer at alum.mit.edu > url: http://bioportal.weizmann.ac.il > > *************************************************** > Plan now for ECCB '06! > 5th European Conference on Computational Biology > Eilat, Israel, Sept 10 -- 13, 2006 > Visit www.eccb06.org for details > -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 494919