[Bioclusters] Re: Parallel Blast

John Smutko bioclusters@bioinformatics.org
Sun, 22 Sep 2002 14:24:37 -0400


--------------52E51B5073111FB58AE6A81B
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

>
>
> Although -- if you put 1 or 2 GB ramdisks in each of your cluster nodes
> and then set up a system for chunking blast databases into
> ramdisk-friendly sizes you could build a really fast blast farm. In
> that context the performance bottleneck would then become the time and
> resources needed to merge the XML output from N queries against split
> databases into a single result file. I've seen such systems in the past
> and merging the results could in some cases take longer than the actual
> search did.
>

Regarding XML output, this is absolutely correct.  The advantage to having
XML is all of the data you could possibly want from your BLAST search is
available and you can parse out whichever pieces you're after.  The
disadvantage is that XML is 2-3X bigger in terms of volume of data produced
compared to pairwise text and over an order of magnitude larger than tabular
(-m 8 in NCBI BLAST).  In a large search (100's - 1000's of queries vs. large
databases), what are you really looking for?  Are you going to eyeball all of
the alignments?  For your sake, I hope not.  Or are you just interested in
what input hit which target and how well?  If the latter, run tabular first,
figure out which alignments you're really interested in, and then run those
jobs singly as you need to see the alignment.  This eliminates a large amount
of storage and I/O issues which are what will slow you down.

>Pentium IIIs are "old" if you listen to Intel :) They have a vested
>interest in moving people to the more expensive Pentium IV platform.
>While it is true that Intel will probably end-of-life them sometime
>soon they are still really good when it comes to price/performance
>ratios.
>
>Many of the large, production-grade and 'conservative' clusters and
>farms I've seen are built around PIII CPUs in the compute elements.
>They are rock solid stable and your choice of motherboards and products
>is still huge.  I've never heard of a PIII cluster falling over because
>of heat or flaky hardware or mainboard reliability problems. Your
>particular needs or benchmark results may point you towards a Pentium
>IV or AMD chip though so do your own testing...

A 1.4 GHz PIII processor can crank through 1 sequence vs the nt database
(blastn) in a bit under 8 seconds, if the entire database is already in
memory.  If this kind of performance is good enough, save money on the
processing side and spend it on a networking/software setup that will let you
keep the processors busy, not waiting for the data to get there or the
results to be written.

John Smutko
smutt235@attbi.com
"Enjoy yourself, it's later than you think..."


--------------52E51B5073111FB58AE6A81B
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>

<blockquote TYPE=CITE>&nbsp;
<p>Although -- if you put 1 or 2 GB ramdisks in each of your cluster nodes
<br>and then set up a system for chunking blast databases into
<br>ramdisk-friendly sizes you could build a really fast blast farm. In
<br>that context the performance bottleneck would then become the time
and
<br>resources needed to merge the XML output from N queries against split
<br>databases into a single result file. I've seen such systems in the
past
<br>and merging the results could in some cases take longer than the actual
<br>search did.
<br>&nbsp;</blockquote>
Regarding XML output, this is absolutely correct.&nbsp; The advantage to
having XML is all of the data you could possibly want from your BLAST search
is available and you can parse out whichever pieces you're after.&nbsp;
The disadvantage is that XML is 2-3X bigger in terms of volume of data
produced compared to pairwise text and over an order of magnitude larger
than tabular (-m 8 in NCBI BLAST).&nbsp; In a large search (100's - 1000's
of queries vs. large databases), what are you really looking for?&nbsp;
Are you going to eyeball all of the alignments?&nbsp; For your sake, I
hope not.&nbsp; Or are you just interested in what input hit which target
and how well?&nbsp; If the latter, run tabular first, figure out which
alignments you're really interested in, and then run those jobs singly
as you need to see the alignment.&nbsp; This eliminates a large amount
of storage and I/O issues which are what will slow you down.
<p>>Pentium IIIs are "old" if you listen to Intel :) They have a vested
<br>>interest in moving people to the more expensive Pentium IV platform.
<br>>While it is true that Intel will probably end-of-life them sometime
<br>>soon they are still really good when it comes to price/performance
<br>>ratios.
<br>>
<br>>Many of the large, production-grade and 'conservative' clusters and
<br>>farms I've seen are built around PIII CPUs in the compute elements.
<br>>They are rock solid stable and your choice of motherboards and products
<br>>is still huge.&nbsp; I've never heard of a PIII cluster falling over
because
<br>>of heat or flaky hardware or mainboard reliability problems. Your
<br>>particular needs or benchmark results may point you towards a Pentium
<br>>IV or AMD chip though so do your own testing...
<p>A 1.4 GHz PIII processor can crank through 1 sequence vs the <i>nt</i>
database (blastn) in a bit under 8 seconds, if the entire database is already
in memory.&nbsp; If this kind of performance is good enough, save money
on the processing side and spend it on a networking/software setup that
will let you keep the processors busy, not waiting for the data to get
there or the results to be written.
<p>John Smutko
<br>smutt235@attbi.com
<br>"Enjoy yourself, it's later than you think..."
<br>&nbsp;</html>

--------------52E51B5073111FB58AE6A81B--