[BioBrew Users] slow ssh response on cluster

Fri Aug 29 14:31:55 EDT 2003

Bill-

You've come across two common slowdown on the cluster, 1) the use of  
ssh and 2) mpirun using ssh to start jobs.  This is a real pain, and  
something I will change in the future, but it takes a whole  
reorganization of mpi.  One thing I can recommend trying is to add rsh  
functionality to the cluster so jobs will launch faster. Instructions  
for doing so are in the Rocks User's Guide.

WRT to HPL, I'm not surprised with the weirdness your seeing, I hear  
about it all of the time, but I'm not an expert at tuning HPL, so I  
can't offer any hints.  If I come across anything, I'll let you know.

Glen

On Thursday, August 28, 2003, at 11:51  PM, Bill Barnard wrote:

> So I'm just getting the cluster running and figured I'd try out the
> Linpack via mpirun benchmark as shown in the Rocks 2.3.2 docs. Things
> are running really slowly. I couldn't get any results for the initial
> try with the downloaded HPL.dat file from
>
> http://www.rocksclusters.org/rocks-documentation/2.3.2/launching- 
> interactive-jobs.html
> http://www.rocksclusters.org/rocks-documentation/2.3.2/examples/HPL.dat
>
> If I drop the number N (I believe that's the dimension of the A matrix
> in the system of linear equations) from 1000 to 100 or 200 I get  
> results
> back pretty quickly, though the throughput is pretty slow.
>
> If I set N to 500 or more then almost every time I've tried it appears
> that some of the jobs hang; I look on the compute nodes (0 & 1) and I
> can see one of the xhpl process has become a zombie. I don't know  
> why...
>
> I started by suspecting that my slow ethernet connections were slowing
> me down. (I'm connecting the cluster with a hub instead of an ethernet
> switch; didn't get the switch running yet...) so I thought I'd play
> around with the parameters in the HPL.dat file to see if I could choose
> a better granularity for my system. I also noted that when running in
> the 512 MB available on each of the two nodes that perhaps 1/3 of it  
> was
> in use with the other 2/3 free; maybe a better granularity would use
> more memory...
>
> In every case, once I make the problem big enough it looks as if the
> slave processes become zombies, and don't return so the overall job
> hangs.
>
> So my next guess is that perhaps mpi uses ssh to communicate between
> nodes. I've previously noticed that cluster-fork and ssh between nodes
> seems very slow. (I've not altered any system configurations from the
> vanilla yet.) It takes about ten seconds for an ssh to complete a
> command. For example:
>
> [billb at rocks-frontend-0 billb]$ cluster-fork ls -l
> compute-0-0:
> total 936
> -rw-rw-r--    1 billb    billb        1054 Aug 28 23:24 HPL.dat
> ...
>
> In /var/log/authpriv I see:
>
> Aug 29 06:01:10 compute-0-0 sshd[2708]: Accepted rsa for billb from
> 10.1.1.1 port 37522
> Aug 29 06:01:20 compute-0-1 sshd[8358]: Accepted rsa for billb from
> 10.1.1.1 port 37530
> Aug 29 06:01:30 compute-0-2 sshd[8216]: Accepted rsa for billb from
> 10.1.1.1 port 37536
> Aug 29 06:01:40 compute-0-3 sshd[8154]: Accepted rsa for billb from
> 10.1.1.1 port 37544
> Aug 29 06:01:51 compute-0-4 sshd[8201]: Accepted rsa for billb from
> 10.1.1.1 port 37552
> Aug 29 06:02:01 compute-0-5 sshd[8197]: Accepted rsa for billb from
> 10.1.1.1 port 37560
> Aug 29 06:02:11 compute-0-6 sshd[8208]: Accepted rsa for billb from
> 10.1.1.1 port 37568
> Aug 29 06:02:21 compute-0-7 sshd[2051]: Accepted rsa for billb from
> 10.1.1.1 port 37576
>
> This seems really slow to me, even for a hub configuration. Have you
> seen anything like this before? Does it sound as if I'm barking up the
> correct tree?
>
> I will confess I've not really yet searched the Rocks list for this
> symptom, but will do so shortly....
>
> Thanks for any advice, help, or flames!
>
> Bill
> -- 
> Bill Barnard <bill at barnard-engineering.com>
>
> _______________________________________________
> BioBrew-Users mailing list
> BioBrew-Users at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/BioBrew-Users
>
>
Glen Otero, Ph.D.
Linux Prophet
619.917.1772