[Bioclusters] gmond and high loads under Suse w/ 2.6 kernel?
Chris Dwan
bioclusters@bioinformatics.org
Thu, 4 Nov 2004 08:17:44 -0500
I'm working with a cluster which has unexplained high load values
(hovering between 1 and 2 with the system sitting idle) on the portal.
It's a 32 node, 64 cpu opteron cluster, running SUSE, with the 2.6
kernel.
When I turn off GANGLIA's gmon daemon, the load drops down to ordinary
rest states (0.1-ish). After some debugging to isolate the behavior,
there's clearly a causal link between gmond on the portal and these
high loads.
Gmond does not appear to be taking very much cpu time, doesn't hang out
in "top", and otherwise doesn't seem to be the real problem. The
cluster is relatively small (32 nodes). If I turn off all of the
cluster gmond processes, the load drops some, but not all the way to a
rest state.
The system is sluggish when the load reports high, but not as sluggish
as I might expect.
Has anyone seen this before? It's more annoying than anything else.
I'm tempted to blame "something in the kernel" and "multicast," but I
would love to have a more robust explanation.
-Chris Dwan
The BioTeam