On Friday, August 29, 2003, at 03:53 PM, Bill Barnard wrote: > My cluster is working okay. I've tested submitting small jobs via SGE, > which seems to work fine. I submitted a few small HPL jobs via SGE, > which worked fine. Large HPL jobs still end up with a zombie process > using SGE. (Will troubleshoot that later...) Zombie processes suck because even if you kill the processes on the frontend, they still will be running on the compute nodes. You have to individually kill them on each of the nodes. Here's an easy way to clean up all the nodes: % cluster-fork skill -KILL -u <username> Do this for any users that have processes, and if you do it as yourself (not root) it will probably give you a disconnection message from each of the nodes, but don't worry about that. After that, if you run 'ps' you shouldn't see any user processes out there. WRT to HPL zombie processes, if the compute nodes are not pentium 4 processors, then you might see zombie process behavior. The binaries for hpl were optimized for the Pentium 4 and uses instructions (SSE2) not available on Pentium III or Athlon. The solution is to recompile the ATLAS library, install it and rebuild hpl against it. It is easiest to just download the Atlas libraries from netlib (prebuilt) http://www.netlib.org/atlas/archives/linux/ But if you want to rebuild atlas and hpl from scratch, you should start by checking out a Rocks CVS source tree. # cvs -d:pserver:anonymous at cvs.rocksclusters.org:/home/cvs/CVSROOT/ \ checkout -r ROCKS_2_3_2_i386 rocks-src and make sure to get the 2_3_2 version and not the HEAD Rebuild and install ATLAS: # cd rocks/src/contrib/atlas # make rpm # rpm -Uvh --force /usr/src/redhat/RPMS/i386/atlas*rpm Rebuild HPL (no need to install it on the frontend if you don't run hpl on the frontend): # cd rocks/src/contrib/hpl # make rpm Rebuild your distribution: # cd /home/install # rocks-dist dist Reinstall your compute nodes: #shoot-node compute-0-1 compute-0-1... The new hpl package will be bound into the new distribution (rocks-dist knows to look in /usr/src/redhat/RPMS for new packages). Then you should be able to run linpack on your cluster. ************ Here is what one user did to build rpms for the Pentium III: I had the same problem with hpl and linpack on Rocks 2.3.2. You can get the source rpm's at this location. ftp://ftp.harddata.pub.rocks.athlon/SRPMS/ These binaries are compiled for the Athlon. The will not work on the PIII. What you need to do for each source rpm file is a "rpmbuild --rebuild --target=i386 atlas....." (Replace atlas ..... With complete source rpm filename.) "rpmbuild --rebuild --target=i386 hpl..... (Replace hpl..... With complete source rpm filename.) If I remember correctly the atlas rebuild went into a loop on a question about a fortran compiler. If that happens you need to edit the spec file (specification file). To get to this file you must extract the files from the source rpm file. To achieve this do the following: 1. "rpm -ivh atlas....." 2. Change into "usr/src/redhat/SPECS" 3. vi the Atlas.spec file. The section you want to edit is the Pentium III section (Shown below) #Pentium III # export PATH=/opt/gcc32/bin:$PATH echo "0 y y n y y <---- This was the line that gave me trouble, I had to remove this line completely. linux 0 /opt/gcc32/bin/g77 -0 y " | make else 4. Save the file and exit vi. 5. do a "rpm -ba atlas.spec" from the "SPECS" directory. This will create a new rpm file. 6. Wait for compile to complete. (Elevator music playing) 7. Change into the "/usr/src/redhat/RPMS/i386" directory and retrieve your new RPM file for the PIII. 8. Install the new rpm on the frontend and all compute nodes. You will also need to reinstall hpl to all nodes as well. ********** HTH! Glen > > Before I open the cluster for use I want to set it up so all jobs are > submitted via SGE/qsub. I can currently submit mpirun directly, so I > can > clearly bypass SGE. Has anyone done this yet? (Not to say that I'm > lazy, > but of course I am lazy...) > > Thanks, > > Bill > -- > Bill Barnard <bill at barnard-engineering.com> > > _______________________________________________ > BioBrew-Users mailing list > BioBrew-Users at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/BioBrew-Users > > Glen Otero, Ph.D. Linux Prophet 619.917.1772 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 4688 bytes Desc: not available Url : http://bioinformatics.org/pipermail/biobrew-users/attachments/20030829/b932bab3/attachment.bin