[Bioclusters] Clusters for bioinformatics... Some numbers or statistics?

Thu, 30 Aug 2001 14:21:06 -0400

I concur with what Jim says;

In general I'm not a big fan of the network-of-workstation ("NOW") 
approach despite the fact that most desktop workstations are (a) very 
powerful and (b) idle a significant portion of the time. We can largely 
thank Microsoft and their bloatware for this gift to the sciences. The 
standard corporate desktop machine that folks are rolling out to support 
Windows 2000 and XP are amazingly powerful: typically 900mhz Pentrium 
III systems (or faster) with 384mb memory and a 30 gig IDE drive. When 
you take that class of system and multiply it by the hundreds, thousands 
or tens of thousands of desktops that a lab or enterprise may have the 
NOW or seti-at-home distributed computing approach becomes attractive.

My problems with the NOW approach have little to do with power and more 
to do with (a) unpredictable available CPU power and (b) non-trivial 
administrative burden and (c) having to trust & run over a public 
network or intranet. It may work nice in a lab, department or workgroup 
but can quickly get hairy in a building, campus or enterprise.

This is why:

o _many_ life science applications are rate limited by I/O throughput 
and the way they get their data is via the network. This means that the 
performance of your NOW system is going to be dependent on the speed and 
uptime of the regular internal intranet. All it takes is the start of 
your IT group's backup server or a couple of porn-downloadin', 
net-radio-listenin' people to trash your network performance. Bad 
network performance can do much more than slow a system down; it can 
cause jobs & data to disappear and other nastyness.

o Workstation owners cannot be trusted [:)] They reboot their machines, 
start burning CDROM's, install new software, etc. etc. What this means 
is that over the long haul the uptime and available CPU cycles for each 
machine are pretty unpredictable. You are unable to handle or even plan 
for peak demand periods.

o Non trivial administrative burden; You end up having to install and 
manage lots of client/server installations on machines that you may not 
have control or even physical access to. With a cluster you can enforce 
uniform configuration control and really automate things to the point 
where each node is pretty much disposable & interchangeble.

My personal preference would be to build a cluster of dedicated servers 
or workstations that are all subnetted on a fast private network. Your 
administrative burden will be less and you will have a good handle on 
system status and overall available CPU horsepower. The money you spend 
will be made up in time, performance and deployment effort.

(side notes...)

One company that is doing the NOW thing in life sciences that I've heard 
of is TurboGenomics. They have TurboBlast available and are apparently 
porting that system into a more general application framework. I had an 
interesting interaction with a TurboGenomics employee at the Drug 
Discovery Conference a few weeks ago, his first words were "Blackstone? 
I'm not allowed to talk to you." heh. Very nice people though despite 
being competitors.

If you are interested in the seti-at-home / distributed.net approach 
then "peer-2-peer" computing companies are a dime a dozen (well maybe 
cheaper since most have crashed and burned). Entropia seems to be 
sending out some interesting press releases in this area at least.

-chris