[Bioclusters] Oscar Linux cluster and Local Blast Server

chris dagdigian bioclusters@bioinformatics.org
Thu, 29 Aug 2002 16:13:54 -0400


Lance Davidow wrote:

> At Massachusetts General Hospital, we are looking to set up a local 
> blast server, preferably with the NCBI algorithm, on a computing
> cluster. We have installed the OSCAR clustering package
> http://sourceforge.net/projects/oscar/
> on a small
> test cluster of computers running Red Hat Linux 7.2 on Intel. Does 
> anyone know of an
> open source blast-for-cluster package that would run under this 
> cluster management suite or is there a
> commercial package you are already running that you think is 
> worthwhile??? We are also considering Apple's Xserve with the G4 
> accelerated blast to compare against the above system.


Hi Lance,

We build such systems all the time. The closest one to you is probably 
over at the Harvard Bauer Center for Genomics Research across the river 
from your institution. The setup is pretty vanilla-- 60 CPU compute farm 
with a GigE network core, customized Redhat 7.2, Platform LSF and about 
3 terabytes of fast network attached storage. Drop me a line directly if 
you'd like me to arrange a tour or meet up somewhere for an informal 
whiteboard/technical chat about how and why it was built as it was. 
George Church has given it his blessing :)

You have lots of options and the people on this mailing list have 
implemented or are actively using almost all of them so this is a good 
place to bounce ideas and experiences off of.

What you should deploy depends mostly on what you want to do with your 
blast service and what your budget is :) . If you are primarily going to 
use it for a high throughput pipeline where you know you are going to be 
doing many thousands of searches against the same database before moving 
on to the next large query set then you may get the best overall deal by 
purchasing one of the specialized hardware acceleration packages from 
companies like TimeLogic and Paracell. Hardware accelerators have their 
drawbacks but they can be very good for high throughput pipelines where 
you just need a fast resource that acts like a "black box" and has a 
very trivial administrative burden which can be very, very important in 
groups with limited IT staff.

If you are not interested in  hardware based solutions then you have two 
other basic options, both of which run on top of your basic cluster and 
in some cases can expand past your cluster to harness CPU cycles from 
other systems and/or desktop boxes.

(1) Add a general purpose resource management layer ('DRM')  like 
Platform LSF or Sun GridEngine to your cluster so you have a 'compute 
farm' capable of running and distributing load from many different 
informatics applications across inexpensive commodity hardware

(2) Purchase application-specific software from companies like 
Blackstone Computing and TurboGenomics both of whom have blast-specific 
product offerings as well as larger frameworks that can handle other 
distributed computational requirements

RLX falls into the (1) category -- they sell Transmeta or PIII based 
blade systems and they struck a deal with Platform so they can bundle 
LSF into their product offerings. This is primarily what allows them to 
offer turnkey blast farms although I bet there is some extra software 
and scripting behind the scenes that goes into the whole package.  I 
personally don't think the current generation of blade servers from any 
vendor are suitable for blast farms because far too many of them rely on 
cheezy 4200 RPM laptop disk drives for their local disk IO which is far 
to slow for life science where many of our apps are IO bound to begin 
with. Many blades also have significant max memory limits as well. . I 
won't seriously consider blades until the form factors allow for full 
size ATA or SCSI drives (Compaq just announced this in their new blade 
chassis). This is just my opinion of course and to their credit RLX has 
successfully sold very large blade packages into places like the Sanger 
Centre so they must be doing something really good (although Sanger was 
already an LSF-based shop to begin with).

In option (1) a DRM  == "Distributed Resource Management". It is 
basically a software layer that allows you to link together many loosely 
coupled (potentially heterogeneous) servers. The DRM layer is what 
handles the process of scheduling your job, executing it on the best 
available resource and returning the results back to you.  In a normal 
compute-farm based blast system the DRM is used mainly to schedule and 
farm out your many standalone ncbi-blast or wu-blast jobs across N 
different compute servers. What you end up with is a system that 
depending on how you tune it is optimized for either (a) very fast 
turnaround on individual jobs or (b) very high aggregate throughput.

The nice thing about a DRM-based compute farm is that you are not 
limited to only running Blast queries -- any program that can execute on 
any of your servers can be used in this way so you have a very flexible 
and scalable research computing / informatics platform that can cope 
with changing times and scientific advances. This is why so many people 
are building them.

Platform LSF is a really nice best-of-breed commercial DRM package for 
clusters and compute farms. It is a very good product and it is 
certainly not cheap although I believe that LSF's price is worth it for 
medium-to-large clusters when you factor in the savings you get in 
increased uptime, better robustness and lower administrative burden. 
(The cost of LSF licensing is trivial if it allows you to avoid having 
to hire an additional cluster admin -- people don't think about human 
and administrative costs enough) Recently though I think Platform has 
gotten a bit of an attitude -- they used to be very responsive and eager 
to deal with life science customers but now the sales force is getting 
complacent and arrogant. If they don't shape up and rethink their 
pricing and sales practices they are going to get their asses kicked by 
GridEngine (which is free and improving at an incredible rate) over the 
next few years.

If the DRM / "compute farm" approach is attractive to you and you have a 
limited budget then consider evaluating Sun GridEngine 
(http://gridengine.sunsource.net). It is a very solid product that is 
making huge gains in the life sciences and is maturing at an amazing 
rate. A year ago there were almost no gridengine based 'bioclusters' and 
now they are popping up all over the place. There is a very active 
gridengine mailing list with a great signal-to-noise ratio that you can 
turn to for support and tuning issues.

Ok, that was (1), now on to the commercial software offerings:

(1) TurboGenomics has a whole framework for distributed informatics 
computation -- TurboBlast is just one of their targeted offerings. I 
can't speak much about their stuff as I'm not that familiar.

(2) Blackstone Computing. Blackstone has a 'big' product called 
PowerCloud that can pretty much encapsulate and expand upon what a 
cluster DRM software layer can do. They also have a targeted suite of 
 modular tools including 'PowerBlast' that just sticks to doing one 
thing fast. I used to be a Blackstone employee so I really should know 
this stuff but I've been gone for almost a year and they have changed 
lots of things. Regardless there are a few Blackstone people on this 
list who can correct/expand if necessary.

There are probabably others that I've missed. You can probably find them 
in the archives of this list. Joe Landman I think recently posted a 
survey of what was out and available.

Apple XServers
===========
Coincidentally enough we (bioteam.net) are building an Xserve blast farm 
with Platform LSF as part of a day long seminar / hands on lab that will 
be occuring on September 10th at Apple's Market Center in Boston. The 
event is free, registration is still open and the full details and 
invite can be found online at 
http://bioteam.net/MacOSX/Xserve-Clustering-Event.html

My usual disclaimer: All of this is just my opinion; don't take anything 
as gospel!

Regards,
Chris

-- 
Chris Dagdigian, <dag@sonsorol.org>
Bioteam.net - Independent Bio-IT & Informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net