[Bioclusters] About the parallel Blast on PVM
chris dagdigian
bioclusters@bioinformatics.org
Tue, 20 Aug 2002 11:11:08 -0400
Many biologists fear customized code or site-specific Blast solutions
because they want to be 100% sure that the statistics and alignments
they get back from a search will be 100% comparable to what a vanilla
wu-blast or ncbi-blast search would return. Anything that does not
return exactly the same results, scores, p-values and alignments as a
standard commandline search will likely cause uneasyness and questions
about the reproducibility of the work.
The "best" Blast servers for biologists that I have seen do not try to
reinvent the wheel with whizzy new implementations of standard heuristic
algorithims, especially when blast is (a) embarassingly parallel anyway
and (b) performs amazingly well on dual-CPU AMD or Intel CPUs.
Running blast on a Sun, SGI or HPaq server is a waste of money. It is
far better to use 'big' machines for jobs that require massive memory or
SMP while using your 'cheap' linux cluster to soak up the load from
embarassingly parrallel stuff like blast etc. Such approaches also
extend the usable lifespan of your big iron machines -- you don't need
to replace them as often if you can dump much of your computational load
on to a compute farm made up of essentially disposable linux boxes.
People with seven-figure Alpha or SGI machines love to hear news like this.
This is how I would configure a blast service for biologists today:
o dual CPU machines (Athalon or Pentium III)
o at least 2GB RAM per node, more if price is reasonable
o at least 2 large IDE disks on separate PCI channels for use with linux
software RAID0
o fastest ethernet topology I could afford
o fastest NAS fileserver I could afford for staging a couple terabytes
worth of blast databases
o Sun GridEngine or Platform LSF doing the scheduling, job execution &
resource allocation
The nodes would run standard wu-blast or ncbi-blast and large jobs would
be controlled by a batch-scheduler / distributed resource management
system such as Platform LSF (commercial & expensive but really good) or
Sun Gridengine (freely available, solid product).
Such a system would be performance bottlenecked at the I/O level
particularly if the blast databases are sitting on the NAS fileserver.
By using dual-ATA drives in your compute nodes with linux software RAID0
you can (a) cache blast databases to local disk and (b) achieve
sustained data read rates exceeding 90mb/second which is faster than
what you can typically do with NFS over gigabit ethernet or a direct
fiber channel connection to a SAN volume.
my $.02
-chris
Mario Belluardo wrote:
>Dear Ognen,
>I'm trying to obtain the best from our hardware possibility to give a
>Blast server to biology scientist. Seem that we could have 32-CPU's
>cluster.
>I'm intereseted in using or testing special code for this use, anyway
>I'm also interested in documentation of modifying NCBI code.
>
>Thanks
>
>bioclusters-request@bioinformatics.org wrote:
>
>
>
>>There is a parallel version of Blast based on PVM written by a former
>>colleague of mine. It was written/tested on our 32-node beowulf cluster.
>>Instead of posting his email address online, if interested people can
>>email me and I will make sure they get in touch with him for sharing
>>experiences / results and possibly obtaining the code (I dont know what
>>licensing agreements he and our former employer have in place).
>>
>>Ognen
>>
>>--__--__--
>>
>>_______________________________________________
>>Bioclusters maillist - Bioclusters@bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>
>>End of Bioclusters Digest
>>
>>
>
>
>
--
Chris Dagdigian, <dag@sonsorol.org>
Bioteam.net - Independent Bio-IT & Informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net