So I'm just getting the cluster running and figured I'd try out the Linpack via mpirun benchmark as shown in the Rocks 2.3.2 docs. Things are running really slowly. I couldn't get any results for the initial try with the downloaded HPL.dat file from http://www.rocksclusters.org/rocks-documentation/2.3.2/launching-interactive-jobs.html http://www.rocksclusters.org/rocks-documentation/2.3.2/examples/HPL.dat If I drop the number N (I believe that's the dimension of the A matrix in the system of linear equations) from 1000 to 100 or 200 I get results back pretty quickly, though the throughput is pretty slow. If I set N to 500 or more then almost every time I've tried it appears that some of the jobs hang; I look on the compute nodes (0 & 1) and I can see one of the xhpl process has become a zombie. I don't know why... I started by suspecting that my slow ethernet connections were slowing me down. (I'm connecting the cluster with a hub instead of an ethernet switch; didn't get the switch running yet...) so I thought I'd play around with the parameters in the HPL.dat file to see if I could choose a better granularity for my system. I also noted that when running in the 512 MB available on each of the two nodes that perhaps 1/3 of it was in use with the other 2/3 free; maybe a better granularity would use more memory... In every case, once I make the problem big enough it looks as if the slave processes become zombies, and don't return so the overall job hangs. So my next guess is that perhaps mpi uses ssh to communicate between nodes. I've previously noticed that cluster-fork and ssh between nodes seems very slow. (I've not altered any system configurations from the vanilla yet.) It takes about ten seconds for an ssh to complete a command. For example: [billb at rocks-frontend-0 billb]$ cluster-fork ls -l compute-0-0: total 936 -rw-rw-r-- 1 billb billb 1054 Aug 28 23:24 HPL.dat ... In /var/log/authpriv I see: Aug 29 06:01:10 compute-0-0 sshd[2708]: Accepted rsa for billb from 10.1.1.1 port 37522 Aug 29 06:01:20 compute-0-1 sshd[8358]: Accepted rsa for billb from 10.1.1.1 port 37530 Aug 29 06:01:30 compute-0-2 sshd[8216]: Accepted rsa for billb from 10.1.1.1 port 37536 Aug 29 06:01:40 compute-0-3 sshd[8154]: Accepted rsa for billb from 10.1.1.1 port 37544 Aug 29 06:01:51 compute-0-4 sshd[8201]: Accepted rsa for billb from 10.1.1.1 port 37552 Aug 29 06:02:01 compute-0-5 sshd[8197]: Accepted rsa for billb from 10.1.1.1 port 37560 Aug 29 06:02:11 compute-0-6 sshd[8208]: Accepted rsa for billb from 10.1.1.1 port 37568 Aug 29 06:02:21 compute-0-7 sshd[2051]: Accepted rsa for billb from 10.1.1.1 port 37576 This seems really slow to me, even for a hub configuration. Have you seen anything like this before? Does it sound as if I'm barking up the correct tree? I will confess I've not really yet searched the Rocks list for this symptom, but will do so shortly.... Thanks for any advice, help, or flames! Bill -- Bill Barnard <bill at barnard-engineering.com>