[Bioclusters] SGE on Mac OS X

Mon, 19 Apr 2004 13:06:27 -0400

Chris Iacovella wrote:

> The instructions are not clear to me as to how to properly install 
> gridengine on other execution hosts, and have those nodes communicate 
> with the headnode.
> 

Ok this is more clear...

The short answer is this:

You do not have to do anything to execution hosts other than install the 
SGE startup script (and make sure the exec hosts are NFS mounting the 
sge directory).

The medium length answer is this:

1. During the SGE install process on the head node you will have been 
asked for the hostnames of your execution hosts. If you input the 
hostnames there then SGE will automatically preconfigure itself to 
"know" about the compute nodes. It will also create the default queues 
for you from template files if you just hit "Y" to the defaults when it 
askes about this.

2. If the SGE head node is already aware of the exec hosts and the 
queues have all been set up then you only need a few minor things on the 
compute nodes:

   a. entry for sge_commd in /etc/services (some ENV var will override 
this if you don't want to edit services or netinfo)

   b. NFS mount the SGE_ROOT directory

   c. Run the "rcsge" script to start the daemons

That should be it -- the basic rule of thumb is that you can configure 
the head node to be aware of compute nodes and queues during the install 
process or afterwards (by issuing manual qconf commands). Once this is 
done the compute nodes just need a NFS mount and a startup script.

All the config work is done on your head node which you have already 
said is working fine. The clients just need to NFS mount the SGE_ROOT, 
start the daemons and check in with the qmaster process. If this fails 
it is usally due to network routing or DNS issues.

If setting up execution hosts and default queues is failing during the 
install script you can still set them up manually as an SGE admin user 
on your working head node. The docs on this are easy -- just look up 
information on how to "add queues" and "add execution hosts".

Various problems could be caused by:

  o bad hostname resolution or DNS issue within the cluster
  o permission or uid/gid mismatch errors on NFS mount
  o firewalls blocking sge_commd traffic

-Chris