[Bioclusters] About the parallel Blast on PVM

chris dagdigian bioclusters@bioinformatics.org
Tue, 20 Aug 2002 11:11:08 -0400

Many biologists fear customized code or site-specific Blast solutions 
because they want to be 100% sure that the statistics and alignments 
they get back from a search will be 100% comparable to what a vanilla 
wu-blast or ncbi-blast search would return. Anything that does not 
return exactly the same results, scores, p-values and alignments as a 
standard commandline search will likely cause uneasyness and questions 
about the reproducibility of the work.

The "best" Blast servers for biologists that I have seen do not try to 
reinvent the wheel with whizzy new implementations of standard heuristic 
algorithims, especially when blast is (a) embarassingly parallel anyway 
and (b) performs amazingly well on dual-CPU AMD or Intel CPUs.

Running blast on a Sun, SGI or HPaq server is a waste of money. It is 
far better to use 'big' machines for jobs that require massive memory or 
SMP while using your 'cheap' linux cluster to soak up the load from 
embarassingly parrallel stuff like blast etc. Such approaches also 
extend the usable lifespan of your big iron machines -- you don't need 
to replace them as often if you can dump much of your computational load 
on to a compute farm made up of essentially disposable linux boxes. 
People with seven-figure Alpha or SGI machines love to hear news like this.

This is how I would configure a blast service for biologists today:

o dual CPU machines (Athalon or Pentium III)
o at least 2GB RAM per node, more if price is reasonable
o at least 2 large IDE disks on separate PCI channels for use with linux 
software RAID0
o fastest ethernet topology I could afford
o fastest NAS fileserver I could afford for staging a couple terabytes 
worth of blast databases
o Sun GridEngine or Platform LSF doing the scheduling, job execution &  
resource allocation

The nodes would run standard wu-blast or ncbi-blast and large jobs would 
be controlled by a batch-scheduler / distributed resource management 
system such as Platform LSF (commercial & expensive but really good) or 
Sun Gridengine (freely available, solid product).

Such a system would be performance bottlenecked at the I/O level 
particularly if the blast databases are sitting on the NAS fileserver. 
By using dual-ATA drives in your compute nodes with linux software RAID0 
you can (a) cache blast databases to local disk and (b) achieve 
sustained data read rates exceeding 90mb/second which is faster than 
what you can typically do with NFS over gigabit ethernet or a direct 
fiber channel connection to a SAN volume.

my $.02


Mario Belluardo wrote:

>Dear Ognen,
>I'm trying to obtain the best from our hardware possibility to give a
>Blast server to biology scientist. Seem that we could have 32-CPU's
>I'm intereseted in using or testing special code for this use, anyway
>I'm also interested in documentation of modifying NCBI code.
>bioclusters-request@bioinformatics.org wrote:
>>There is a parallel version of Blast based on PVM written by a former
>>colleague of mine. It was written/tested on our 32-node beowulf cluster.
>>Instead of posting his email address online, if interested people can
>>email me and I will make sure they get in touch with him for sharing
>>experiences / results and possibly obtaining the code (I dont know what
>>licensing agreements he and our former employer have in place).
>>Bioclusters maillist  -  Bioclusters@bioinformatics.org
>>End of Bioclusters Digest

Chris Dagdigian, <dag@sonsorol.org>
Bioteam.net - Independent Bio-IT & Informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net