[Bioclusters] Re: Best ways to tackle migration to dedicated cluster/Farm

Ross Crowhurst bioclusters@bioinformatics.org
Fri, 26 Mar 2004 10:15:32 +1200

>From: "Chris Dwan (CCGB)" <cdwan@mail.ahc.umn.edu>
>Do you have a feeling for where this anti-cycle-stealing attitude
>from?  Like Chris Dagdigian said, it sounds like you've got
>well in hand, and there are a wide variety of examples of folks using
>production workstations to augment their dedicated clusters.  If the
>concerns are reliability, security, performance impact, or other
>things, those can be worked with numbers and tests.

Our system actually uses dual OS desktops, so its really an ad-hoc
cluster/farm. The primary OS is WindowsXP which is used during normal
working hours but is not actually used within our bioinformatics
pipeline so we are not actually cycle-stealing much as I would like to
do that. When staff leave at night they simply reboot there desk tops
(Linux - RH8.0 is the default OS on these machines) so the staff member
happily goes home and their machine boots to Linux, synchronises itself
and joins our pipeline. So its not really cycle stealing just
utilisation outside normal hours. Apologies if I did not make this clear
initially. The key issues are therefire OS management issues - TOC (need
for a second hard disk, time to install second OS, how to roll out
updates for security patches etc). 

>If, on the other hand, it's concern over trying something new, the
>system recently implemented at Novartis is a decent example of a
>corporation gaining a great deal of horsepower this way.

Can you provide more details on this or point me in the right direction
to get more info please?

>I will almost never get in the way of someone who wants to go to a
>dedicated, cluster system if they need it to get their work done, and
>they have the money to spend.  A well thought out, centralized
>will almost always be easier (and cheaper) to administer than a cycle
>stealing solution.


>Pull systems (including SETI@home and friends, as well as Platform's
>offering) are best suited for cycle stealing and ad-hoc clustering. 
>can be really troublesome to debug, since what gets lost in the case
>errors is usually job state, instead of cycles on nodes.  Instead
>of the cluster operating at less than peak efficiency, it loses jobs.
>This can be frustrating for both users and admins.
Lost jobs are not really an issue for us. We have processes running to
reap jobs that have not completed and reset their availability status.
Most of our runs are of durations that jobs lost on nodes that "leave"
the pipeline can be reaped and reassigned without holding up the overall
pipeline unduly. I have reapers that check the online status of nodes
hourly (could be done more frequently) and reset there status where the
node has left the farm. Additionally there are a reapers that run based
on expected times for completion of different job types and based on the
number of jobs left to do in that batch (time between checks decreases
as number of jobs decrease) so we tend to catch lost jobs. Efficiency in
a pull system does however decrease at the end of a batch run where
chunks of jobs are grabbed by the nodes and you have no control over
which nodes grab the jobs. If the last few chucks are grabbed by the
slowest nodes then the pipeline will have to wait for these to complete
before it moves on. To get around this we run multiple pipelines so the
nodes are pretty much working all the time processing. We only really
have a "single user" in our pipeline which is the automated pipeline
control process itself. I appreciate your comments about frustratiosn
for users and admins with "pull" approaches where you have multiple
users - they definitely could experience delays.

The contents of this e-mail are privileged and/or confidential to the
named recipient and are not to be used by any other person and/or
organisation. If you have received this e-mail in error, please notify 
the sender and delete all material pertaining to this e-mail.