[Bioclusters] SGE on Mac OS X
Chris Dagdigian
bioclusters@bioinformatics.org
Mon, 19 Apr 2004 13:06:27 -0400
Chris Iacovella wrote:
> The instructions are not clear to me as to how to properly install
> gridengine on other execution hosts, and have those nodes communicate
> with the headnode.
>
Ok this is more clear...
The short answer is this:
You do not have to do anything to execution hosts other than install the
SGE startup script (and make sure the exec hosts are NFS mounting the
sge directory).
The medium length answer is this:
1. During the SGE install process on the head node you will have been
asked for the hostnames of your execution hosts. If you input the
hostnames there then SGE will automatically preconfigure itself to
"know" about the compute nodes. It will also create the default queues
for you from template files if you just hit "Y" to the defaults when it
askes about this.
2. If the SGE head node is already aware of the exec hosts and the
queues have all been set up then you only need a few minor things on the
compute nodes:
a. entry for sge_commd in /etc/services (some ENV var will override
this if you don't want to edit services or netinfo)
b. NFS mount the SGE_ROOT directory
c. Run the "rcsge" script to start the daemons
That should be it -- the basic rule of thumb is that you can configure
the head node to be aware of compute nodes and queues during the install
process or afterwards (by issuing manual qconf commands). Once this is
done the compute nodes just need a NFS mount and a startup script.
All the config work is done on your head node which you have already
said is working fine. The clients just need to NFS mount the SGE_ROOT,
start the daemons and check in with the qmaster process. If this fails
it is usally due to network routing or DNS issues.
If setting up execution hosts and default queues is failing during the
install script you can still set them up manually as an SGE admin user
on your working head node. The docs on this are easy -- just look up
information on how to "add queues" and "add execution hosts".
Various problems could be caused by:
o bad hostname resolution or DNS issue within the cluster
o permission or uid/gid mismatch errors on NFS mount
o firewalls blocking sge_commd traffic
-Chris