[Bioclusters] Clusters for bioinformatics... Some numbers or statistics?
chris dagdigian
dag@sonsorol.org
Thu, 30 Aug 2001 14:21:06 -0400
I concur with what Jim says;
In general I'm not a big fan of the network-of-workstation ("NOW")
approach despite the fact that most desktop workstations are (a) very
powerful and (b) idle a significant portion of the time. We can largely
thank Microsoft and their bloatware for this gift to the sciences. The
standard corporate desktop machine that folks are rolling out to support
Windows 2000 and XP are amazingly powerful: typically 900mhz Pentrium
III systems (or faster) with 384mb memory and a 30 gig IDE drive. When
you take that class of system and multiply it by the hundreds, thousands
or tens of thousands of desktops that a lab or enterprise may have the
NOW or seti-at-home distributed computing approach becomes attractive.
My problems with the NOW approach have little to do with power and more
to do with (a) unpredictable available CPU power and (b) non-trivial
administrative burden and (c) having to trust & run over a public
network or intranet. It may work nice in a lab, department or workgroup
but can quickly get hairy in a building, campus or enterprise.
This is why:
o _many_ life science applications are rate limited by I/O throughput
and the way they get their data is via the network. This means that the
performance of your NOW system is going to be dependent on the speed and
uptime of the regular internal intranet. All it takes is the start of
your IT group's backup server or a couple of porn-downloadin',
net-radio-listenin' people to trash your network performance. Bad
network performance can do much more than slow a system down; it can
cause jobs & data to disappear and other nastyness.
o Workstation owners cannot be trusted [:)] They reboot their machines,
start burning CDROM's, install new software, etc. etc. What this means
is that over the long haul the uptime and available CPU cycles for each
machine are pretty unpredictable. You are unable to handle or even plan
for peak demand periods.
o Non trivial administrative burden; You end up having to install and
manage lots of client/server installations on machines that you may not
have control or even physical access to. With a cluster you can enforce
uniform configuration control and really automate things to the point
where each node is pretty much disposable & interchangeble.
My personal preference would be to build a cluster of dedicated servers
or workstations that are all subnetted on a fast private network. Your
administrative burden will be less and you will have a good handle on
system status and overall available CPU horsepower. The money you spend
will be made up in time, performance and deployment effort.
(side notes...)
One company that is doing the NOW thing in life sciences that I've heard
of is TurboGenomics. They have TurboBlast available and are apparently
porting that system into a more general application framework. I had an
interesting interaction with a TurboGenomics employee at the Drug
Discovery Conference a few weeks ago, his first words were "Blackstone?
I'm not allowed to talk to you." heh. Very nice people though despite
being competitors.
If you are interested in the seti-at-home / distributed.net approach
then "peer-2-peer" computing companies are a dime a dozen (well maybe
cheaper since most have crashed and burned). Entropia seems to be
sending out some interesting press releases in this area at least.
-chris