[Bioclusters] Call for information.
Chris Dagdigian
bioclusters@bioinformatics.org
Tue, 16 Apr 2002 09:13:22 -0400
Some of the justifications will depend on your audience; are they IT
people who control your hardware budget or are they senior scientists
who need to approve new research computing directives etc.
Here are some of the justifications I've used for both types of groups:
(1) Bioclusters preserve any current investment you may have in big
expensive unix SMP machines by significantly reducing the computational
load on your legacy hardware. Basically you use the large memory and big
SMP systems for things like EST clustering and data warehouses that need
such environments and you offload everything else you can onto piles of
cheap mass market hardware. I know several companies who were able to
postpone or actually cancel plans to upgrade or replace large Sun, Alpha
and SGI machines because they were able to extend the useful server life
by migrating load to the far cheaper cluster or compute farm. Not having
to replace or upgrade one of those large systems can save hundreds of
thousands or even millions of dollars in capital expense.
(2) Fine grained scaling on demand. In a biocluster it is trivial to add
additional CPUS. As long as your architcure is correct you can
incrementally scale easily and cheaply from tens of CPUs to hundreds or
thousands. Compare and contrast this to the problem of upgrading a large
unix machine. That 64-CPU enterprise unix system may be great but what
happens when you need that 65th CPU? It may require purchase of a whole
new cabinet and expensive interconnects just to get that next processor
fired up. The other nice thing about scaling with bioclusters is that it
is easy to take advantage of newer and faster hardware. Load management
layers like LSF, PBS etc. can trivially handle heterogeneous hardware
environments so it is not a problem to have your cluster composed of
different machine classes. This allows you to effeciently purchase the
fastest available commodity CPU power each year with little waste. Plus
if you work the proper magic with the load management software layer
your end users will never know or have to understand the back end. ALl
they know is that their jobs get done.
(3) For high throughput embarassingly parallel situations like massive
BLAST & hmmsearch searches etc. etc. a biocluster will blow away any
enterprise unix system you can think of. As a concrete example of this
when I was at Blackstone Computing we were able to build a
proof-of-concept dedicated Blast farm with $30,000 USD worth of
commodity hardware.
That $30,000 demo blast farm was tested by the customer (a large pharma
company) and was found to be significantly faster than the $300,000 +
unix system they were currently using. The system was so fast
(throughtput, not turnaround) the customer was able to perform
calculations and experiments that had not been possible before due to
time and horsepower constraints.
This (#3) is the primary reason that I see people building bioclusters.
THe know that they have a huge requirement to run lots of conveniently
embarassingly parallel applications in a high throughput mode. As it
turns out a loosely coupled cluster or compute farm tends to be a really
nice and effective platform for doing this. Many of the first
"bioclusters" were actually dedicated BLAST, genescan, hmm etc.
resources although these days they are being used for more.
(4) Linux on commercial mass market hardware is _incredibly_ powerful
from a price/performance standpoint. The Intel/AMD cpus are amazing. If
you have a software application or algorithim that runs well under Linux
and you need to run lots of them then a cluster is a great choice.
(5) What it comes down to is that leveraging piles of inexpensive
commodity hardware is the only cost effective way that life science
researchers can really get the flexible "supercomputer scale" CPU power
they need to perform their work.
(6) A hell of a lot of bioinformatics software development is now being
primarily developed or ported to linux-on-i386.
I do have some links that may be useful; particularly Matthew Trunnel's
article in scientific computing world but I don't have the URLs handy
and I need to run out to a meeting :) I'll follow up with URLs when I
get back.
Anyone else with comments?
-Chris
Paul Gardner wrote:
> Hi All,
>
> I have to give a talk on thurs 12pm (NZST) that justifies the expense of
> purchasing 128 PentiumIVs for a BioCluster at our weekly Research Group
> meeting.
>
> I already know a bit about using the MPI compiler and PBS queuing system.
> What I'm really interested in is the solutions BioClusters are currently
> being used for. Any URLs, papers, and/or suggestions would be greatly
> appreciated.
>
--
Chris Dagdigian, <dag@sonsorol.org>
Independent life science IT & research computing consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
Work: http://BioTeam.net PGP KeyID: 83D4310E Yahoo IM: craffi