[Bioclusters] [long-ish] Advice on getting started with clustering, LSF, Xserve?

Thu, 7 Nov 2002 18:32:58 -0500 (EST)

On Wed, 6 Nov 2002, Simon Twigger wrote:
> Hi there,
>
> I stumbled across this mailing list today searching through some
> bioperl archives and I'm hoping that someone out there can point me in

Make sure you read through the archives of 
this list. I think you will be able to find some helpful 
suggestions there to make your up coming decisions easier.

From reading your post it seems that you might be under
some pressure to employ existing ,albeit heterogeneous,
machines to solve some problems. There are solutions 
which, for example, use java as a layer to distribute 
work across a variety of architectures. 

There are also pipe lines available which might use things like 
SOAP, XML, etc to abstract analysis from the researcher 
(but not from the system administrator who has to set it 
all up and make it work). Check www.bioinforatics.org for 
a list of budding projects. These will give you an
idea of how people are designing software.

Just be careful to distinguish between solutions whose 
underlying technology exists to borrow or steal cycles 
off dormant or lightly loaded machines from solutions 
which are intended to be used with a known number of machines. 

In the cycle stealing/borrowing sceanrio
it may never be clear just how many systems will be 
working on a given job making it difficult to present 
good estimates on completion times. This is a problem
if you are attempting to accomodate the interests of
several researchers all of whom need data by a grant
deadline. 

As for some practical details I use a modestly sized cluster 
of Appro computers with Redhat 7.3 and Platform LSF 5.0 to perform
Blasting of the human genome and to provide a web based Blast service
for my users. Right now we backend it with some Perl and
use LSF to distribute jobs and collect results. All in all
a good setup though I'm starting to run Grid Engine in parallel 
(no pun intended) to test its suitability as a total replacement.

My interest in replacing LSF with SGE is based on
the price of LSF licenses. However I notice that attractive
discounts are available when purchasing through particular
resellers. LSF is good , flexible software no doubt about it.
But we are always under pressure to do more with less money
hence my interest in SGE.

I've also been using TurbGenomics TurboBlast on the cluster
which uses java to distribute the work. (There is also
a python dependency). It comes with some  wrappers and 
suggestions  for use with PBS. I'm not restricted to just my cluster 
nodes. I could run a "blastworker" on a windows 2000 box 
(after installing python and java of course) and 
establish some load thresholds which keeps the system
usable for interactive purposes. I do spend alot of time
adjusting parameters on each client to make sure that
the optimal number of machines are engaged.  But I also 
spend time tuning the LSF software to make job distribution 
easier. Refinement and tuning are ongoing.

As far as organizing the cluster I argue that the more
assumptions you can make about the architecture, operating
system, and filesystem layout the less work you have waiting
for you after things are setup. There are some fine tools
which let you provision and image disks so you can manage
various architectures and toss their images onto servers
for quick reinstalls and restores in the event of hardware
failure.

If you don't have  much IT support staff or free student labor
then check out the vendors who will sell you a rack-in-a-box.
RLX (their control tower software is very nice) ,Microway, 
etc all have appealing solutions that are prepackaged in a 
way to minimize startup costs and labor. 

I tend to toss data out onto the nodes which means I 
might use software RAID to improve performance. I don't
do much over NFS except to serve the LSF tree but even
thats not a requirement. I like to cache data locally
which means quicker Blasting in my case. I split up
my databases so when I add new nodes I repslit and
redistribute (programmatically of course).

Depending on the nature of your development you might
want to check out SSI models  (MOSIX, Scyld) which,
in a grossly simplified  manner, means you can
treat a group of systems as a single process space.

As far as Xserves I've yet to get my hands on an eval unit 
so can't say  much about them. If their advertised price is the
actual street price then  I can buy two of what I'm already 
using (dual atlons 1800+ ,2X40 GB HD,and 2GB RAM) for the
same money. The Xserves are worth checking out though.

Good Luck

On Wed, 6 Nov 2002, Simon Twigger wrote:

> Hi there,
> 
> I stumbled across this mailing list today searching through some  
> bioperl archives and I'm hoping that someone out there can point me in  
> the right direction to get myself up to speed on bioclusters, both on  
> the hardware and software side. Im more from the bio-side of  
> bioinformatics and Im trying to understand more of the nitty gritty  
> informatics/computer part!
> 
> We've been writing bioinformatics software in perl/java for a while and  
> we've got Oracle and MySQL databases and we run all the usual  
> genome/sequence analysis packages (blast, blat, etc) plus some of our  
> own annotation pipelines. Historically we've been running these on  
> multiple machines but not really in a cluster with robust load  
> management software or any significant modifications to how we write  
> our code to enable it to scale in a multiprocessor environment. I'm  
> trying to find out better ways to use our Sun, Compaq and (probably)  
> MacOS machines, how to get them all working together to handle both  
> genomic and proteomic analyses and how to modify our existing and new  
> code to work in this environment.
> 
> I'd love to find some sort of 'bioclustering for dummies' that outlines  
> the usual solutions and approaches, also on the software side something  
> that describes the fundamentals of writing perl and java to exploit  
> clusters and even some simple examples/test packages that I could play  
> with to get my feet wet.
> 
> A few specific things that Im thinking about, perhaps people can  
> comment on my rationale
> We have a variety of platforms and it would be great to make them all  
> play together - LSF appears to be a good solution to handle load  
> balancing on a heterogeneous set of servers (we have Sun, Compaq and  
> will probably add Xserves into the mix), from my reading the downside  
> is the price ($400 per server was a price I saw quoted on the list).  
> ease of administration seems to be another pro for LSF which is a big  
> thing as we just want it to work, we dont really want to babysit this  
> stuff - what sort of sysadmin commitment is needed to make this work?
> 
> Im personally interested in trying the Xserve, the storage capacity,  
> speed, price, etc. all make it attractive as an alternative to our  
> traditional options. Oracle is coming out for OS X (and the developer  
> release is running on my Powerbook as we speak) so that's another good  
> thing. Im doing all my development on a G4 with 10.2 and its great, any  
> thoughts/experiences with using Xserve in the mix with other platforms  
> and Xserve vs intel solutions?
> 
> Many thanks for any help anyone can give a newbie in the field!
> 
> Simon.
> 
> 
> ------------------------------------------------------------------------ 
> --------------------------
> Simon Twigger, Ph.D.
> Assistant Professor, Bioinformatics Research Center
> 
> Medical College of Wisconsin
> 8701 Watertown Plank Road,
> Milwaukee, WI, 53226
> tel. 414-456-8802, fax 414-456-6595
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>