[Bioclusters] Oscar Linux cluster and Local Blast Server
chris dagdigian
bioclusters@bioinformatics.org
Thu, 29 Aug 2002 16:13:54 -0400
Lance Davidow wrote:
> At Massachusetts General Hospital, we are looking to set up a local
> blast server, preferably with the NCBI algorithm, on a computing
> cluster. We have installed the OSCAR clustering package
> http://sourceforge.net/projects/oscar/
> on a small
> test cluster of computers running Red Hat Linux 7.2 on Intel. Does
> anyone know of an
> open source blast-for-cluster package that would run under this
> cluster management suite or is there a
> commercial package you are already running that you think is
> worthwhile??? We are also considering Apple's Xserve with the G4
> accelerated blast to compare against the above system.
Hi Lance,
We build such systems all the time. The closest one to you is probably
over at the Harvard Bauer Center for Genomics Research across the river
from your institution. The setup is pretty vanilla-- 60 CPU compute farm
with a GigE network core, customized Redhat 7.2, Platform LSF and about
3 terabytes of fast network attached storage. Drop me a line directly if
you'd like me to arrange a tour or meet up somewhere for an informal
whiteboard/technical chat about how and why it was built as it was.
George Church has given it his blessing :)
You have lots of options and the people on this mailing list have
implemented or are actively using almost all of them so this is a good
place to bounce ideas and experiences off of.
What you should deploy depends mostly on what you want to do with your
blast service and what your budget is :) . If you are primarily going to
use it for a high throughput pipeline where you know you are going to be
doing many thousands of searches against the same database before moving
on to the next large query set then you may get the best overall deal by
purchasing one of the specialized hardware acceleration packages from
companies like TimeLogic and Paracell. Hardware accelerators have their
drawbacks but they can be very good for high throughput pipelines where
you just need a fast resource that acts like a "black box" and has a
very trivial administrative burden which can be very, very important in
groups with limited IT staff.
If you are not interested in hardware based solutions then you have two
other basic options, both of which run on top of your basic cluster and
in some cases can expand past your cluster to harness CPU cycles from
other systems and/or desktop boxes.
(1) Add a general purpose resource management layer ('DRM') like
Platform LSF or Sun GridEngine to your cluster so you have a 'compute
farm' capable of running and distributing load from many different
informatics applications across inexpensive commodity hardware
(2) Purchase application-specific software from companies like
Blackstone Computing and TurboGenomics both of whom have blast-specific
product offerings as well as larger frameworks that can handle other
distributed computational requirements
RLX falls into the (1) category -- they sell Transmeta or PIII based
blade systems and they struck a deal with Platform so they can bundle
LSF into their product offerings. This is primarily what allows them to
offer turnkey blast farms although I bet there is some extra software
and scripting behind the scenes that goes into the whole package. I
personally don't think the current generation of blade servers from any
vendor are suitable for blast farms because far too many of them rely on
cheezy 4200 RPM laptop disk drives for their local disk IO which is far
to slow for life science where many of our apps are IO bound to begin
with. Many blades also have significant max memory limits as well. . I
won't seriously consider blades until the form factors allow for full
size ATA or SCSI drives (Compaq just announced this in their new blade
chassis). This is just my opinion of course and to their credit RLX has
successfully sold very large blade packages into places like the Sanger
Centre so they must be doing something really good (although Sanger was
already an LSF-based shop to begin with).
In option (1) a DRM == "Distributed Resource Management". It is
basically a software layer that allows you to link together many loosely
coupled (potentially heterogeneous) servers. The DRM layer is what
handles the process of scheduling your job, executing it on the best
available resource and returning the results back to you. In a normal
compute-farm based blast system the DRM is used mainly to schedule and
farm out your many standalone ncbi-blast or wu-blast jobs across N
different compute servers. What you end up with is a system that
depending on how you tune it is optimized for either (a) very fast
turnaround on individual jobs or (b) very high aggregate throughput.
The nice thing about a DRM-based compute farm is that you are not
limited to only running Blast queries -- any program that can execute on
any of your servers can be used in this way so you have a very flexible
and scalable research computing / informatics platform that can cope
with changing times and scientific advances. This is why so many people
are building them.
Platform LSF is a really nice best-of-breed commercial DRM package for
clusters and compute farms. It is a very good product and it is
certainly not cheap although I believe that LSF's price is worth it for
medium-to-large clusters when you factor in the savings you get in
increased uptime, better robustness and lower administrative burden.
(The cost of LSF licensing is trivial if it allows you to avoid having
to hire an additional cluster admin -- people don't think about human
and administrative costs enough) Recently though I think Platform has
gotten a bit of an attitude -- they used to be very responsive and eager
to deal with life science customers but now the sales force is getting
complacent and arrogant. If they don't shape up and rethink their
pricing and sales practices they are going to get their asses kicked by
GridEngine (which is free and improving at an incredible rate) over the
next few years.
If the DRM / "compute farm" approach is attractive to you and you have a
limited budget then consider evaluating Sun GridEngine
(http://gridengine.sunsource.net). It is a very solid product that is
making huge gains in the life sciences and is maturing at an amazing
rate. A year ago there were almost no gridengine based 'bioclusters' and
now they are popping up all over the place. There is a very active
gridengine mailing list with a great signal-to-noise ratio that you can
turn to for support and tuning issues.
Ok, that was (1), now on to the commercial software offerings:
(1) TurboGenomics has a whole framework for distributed informatics
computation -- TurboBlast is just one of their targeted offerings. I
can't speak much about their stuff as I'm not that familiar.
(2) Blackstone Computing. Blackstone has a 'big' product called
PowerCloud that can pretty much encapsulate and expand upon what a
cluster DRM software layer can do. They also have a targeted suite of
modular tools including 'PowerBlast' that just sticks to doing one
thing fast. I used to be a Blackstone employee so I really should know
this stuff but I've been gone for almost a year and they have changed
lots of things. Regardless there are a few Blackstone people on this
list who can correct/expand if necessary.
There are probabably others that I've missed. You can probably find them
in the archives of this list. Joe Landman I think recently posted a
survey of what was out and available.
Apple XServers
===========
Coincidentally enough we (bioteam.net) are building an Xserve blast farm
with Platform LSF as part of a day long seminar / hands on lab that will
be occuring on September 10th at Apple's Market Center in Boston. The
event is free, registration is still open and the full details and
invite can be found online at
http://bioteam.net/MacOSX/Xserve-Clustering-Event.html
My usual disclaimer: All of this is just my opinion; don't take anything
as gospel!
Regards,
Chris
--
Chris Dagdigian, <dag@sonsorol.org>
Bioteam.net - Independent Bio-IT & Informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net