[Bioclusters] Parallel blast

Joe Landman bioclusters@bioinformatics.org
07 Jun 2002 08:58:20 -0400


There are several commercial parallel BLASTs out there:

1) Blackstone's PowerBLAST (part of PowerCloud),

2) TurboGenomics' TurboBLAST (more of a grid-like blast than a cluster
BLAST),

3) Paracel's BLAST Machine

4) (mentioned with trepidation, as some on this list know) my MSC.LIFE
code with BLAST.

I do not know of non-commercial parallel BLASTs.  

Of these codes, the first 2 return results in XML (which makes parsing
real easy, recombination is simply a concatenation or a commit into an
XML database).  PowerCloud makes use of a policy centric approach that
allows you to use clusters and allocate them according to "business
logic" (e.g. rules about who gets what).  If Glen Otero is around, maybe
he could describe this somewhat better.

I do not know precisely what Paracel's code does.

I built number 4.  And yes, it works quite well (with some caveats about
how to get it out of my employer now due to changes in focus).  It does
work by segmenting the input query sequences, optionally segmenting the
databases (this isnt always a performance win though), distributing them
to compute nodes, running the jobs, and collecting the output, and
reassembling in order.  My acid test has been to do an MD5 sum on the
output of a normal NCBI BLAST run, and my code.  If they are identical,
then the files are identical (down to whitespace).  This was part of my
design goals.

I am getting about 60x speedup on 64 CPUs for 28k tomato clone EST's
(about 1000 bp length average) for blastx vs nr from July 2001.  My
pathological case (e.g. worst case) was something Ivo Grosse suggested
with Chr21 vs pufferfish, where I was getting about 8x speedup on 16
CPUs.  

The problem is getting it out of my employer now.  Email me offline if
you want to understand this better.  This is in large part why I am
looking for a new employer.   

Now, MPI or PVM BLAST might not work terribly well, as MPI and PVM were
designed to solve somewhat different problems.  This would be the
subject of a very long post.  Such a solution would not be fault
tolerant (as PVM and MPI are not currently fault tolerant, though good
work has been done at LLNL on making MPI work this way).

If someone knows of other parallel BLASTs, please let me know.  

On Fri, 2002-06-07 at 07:30, Wim Glassee wrote:
> Hi everyone,
> 
> I just went through the archives for this list, very interesting topics!
> 
> One question I've been trying to answer for a long time now is this:
> 
> Is there or is there not a parallel version of blast available
> somewhere?
> 
> I've noticed some people cut their databases and query sequences to
> smaller pieces, with or without overlap, and perform separate blasts.
> But how do you put them back together again? And are the results the
> same?
> 
> Wim
> 
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters