[BiO BB] Can you explain theses results?

Thu Apr 26 12:05:09 EDT 2007

> 
> Hi for all
> 
> Please, you would see theses tests in serial machine, but important to
> cluster.
> What do you think about it?
> 
> Please look theses tests using BLASTP:
> 
> Length Seq Query  -- HITs -- TIME
> 10000                   -- 3    -- 8min51sec
> 10000                   -- 500  -- 7min41sec
> ----------------------------------
> Length Seq Query  -- HITs -- TIME
> 9000                      --  2   --
>  7min11sec
> 9000                      --  500 -- 6min49sec
> ----------------------------------
> Length Seq Query  -- HITs -- TIME
> 3000                      --  3   -- 2min54sec
> 3000                     --  500 -- 2min52sec
> 
> Theses
> times are very strange. You can see which the sequence of 10000 bases
> and 3 hits get more time than another sequence of 10000 bases but with
> 500 hits. So, I can conclude: BLASTP is not sensitive to similarity,
> difference of BLASTN.
> 
> But the most important, why this happened? Can anybody explain theses
> results?
> 

Daniel

Without knowledge of the parameters and data you used I can offer you a
couple of comments. The BLAST heuristic operates in a several steps.
First, the query and the targets are broken up into tokens based on the
word size you select. The default size is ll for nucleotides and 4 for
proteins (I think--check the man page). The targets are then searched
for tokens that match tokens in the query. The matches are used to seed
alignments performed with the Smith-Waterman algorithm. The bottom line
is that how long a given run takes is a complicated function of the
parameters and the data. I recall there was a paper in Bioinformatics
that calculated the computational complexity, but I can't lay my hands
on it at the moment. You'll notice that the time is dominated by query
length, i.e., the time it takes to search for matches. The difference in
the time for hits depends on how long the hits were. 

There's a good description of blast in the book Bioinformatics by David
Mount. I would also recommend the book Blast by Korf, Yandell, and
Bedell. It has a whole chapter on setting parameters for various types
of searches. (You should never just use the defaults unless, of course,
they are correct for the search you are doing ;-) ). Also, there are
several different implementations of blast (NCBI and WUBLAST being the
two most popular, I think) that perform differently under different
circumstances. There is a parallelized version called mpiBlast and at
least one hardware-accelerated version that I know of.

Good Luck

Mike