[BioBrew Users] slow ssh response on cluster

Fri Aug 29 02:51:28 EDT 2003

So I'm just getting the cluster running and figured I'd try out the
Linpack via mpirun benchmark as shown in the Rocks 2.3.2 docs. Things
are running really slowly. I couldn't get any results for the initial
try with the downloaded HPL.dat file from 

http://www.rocksclusters.org/rocks-documentation/2.3.2/launching-interactive-jobs.html
http://www.rocksclusters.org/rocks-documentation/2.3.2/examples/HPL.dat

If I drop the number N (I believe that's the dimension of the A matrix
in the system of linear equations) from 1000 to 100 or 200 I get results
back pretty quickly, though the throughput is pretty slow.

If I set N to 500 or more then almost every time I've tried it appears
that some of the jobs hang; I look on the compute nodes (0 & 1) and I
can see one of the xhpl process has become a zombie. I don't know why...

I started by suspecting that my slow ethernet connections were slowing
me down. (I'm connecting the cluster with a hub instead of an ethernet
switch; didn't get the switch running yet...) so I thought I'd play
around with the parameters in the HPL.dat file to see if I could choose
a better granularity for my system. I also noted that when running in
the 512 MB available on each of the two nodes that perhaps 1/3 of it was
in use with the other 2/3 free; maybe a better granularity would use
more memory...

In every case, once I make the problem big enough it looks as if the
slave processes become zombies, and don't return so the overall job
hangs.

So my next guess is that perhaps mpi uses ssh to communicate between
nodes. I've previously noticed that cluster-fork and ssh between nodes
seems very slow. (I've not altered any system configurations from the
vanilla yet.) It takes about ten seconds for an ssh to complete a
command. For example:

[billb at rocks-frontend-0 billb]$ cluster-fork ls -l
compute-0-0:
total 936
-rw-rw-r--    1 billb    billb        1054 Aug 28 23:24 HPL.dat
...

In /var/log/authpriv I see:

Aug 29 06:01:10 compute-0-0 sshd[2708]: Accepted rsa for billb from
10.1.1.1 port 37522
Aug 29 06:01:20 compute-0-1 sshd[8358]: Accepted rsa for billb from
10.1.1.1 port 37530
Aug 29 06:01:30 compute-0-2 sshd[8216]: Accepted rsa for billb from
10.1.1.1 port 37536
Aug 29 06:01:40 compute-0-3 sshd[8154]: Accepted rsa for billb from
10.1.1.1 port 37544
Aug 29 06:01:51 compute-0-4 sshd[8201]: Accepted rsa for billb from
10.1.1.1 port 37552
Aug 29 06:02:01 compute-0-5 sshd[8197]: Accepted rsa for billb from
10.1.1.1 port 37560
Aug 29 06:02:11 compute-0-6 sshd[8208]: Accepted rsa for billb from
10.1.1.1 port 37568
Aug 29 06:02:21 compute-0-7 sshd[2051]: Accepted rsa for billb from
10.1.1.1 port 37576

This seems really slow to me, even for a hub configuration. Have you
seen anything like this before? Does it sound as if I'm barking up the
correct tree?

I will confess I've not really yet searched the Rocks list for this
symptom, but will do so shortly....

Thanks for any advice, help, or flames!

Bill
-- 
Bill Barnard <bill at barnard-engineering.com>