[Bioclusters] Re: Help on BLAST

28 Aug 2002 09:05:11 -0400

On Wed, 2002-08-28 at 05:34, Wim Glassee wrote:

> Point taken. I take it you use this kind of merging yourself. Have you
> had a change to think about my last question? About the differences in
> -actual- results, hits and hsps. I've had differences on several
> occasions.
> E.g. a hit residing in the middle of a 5k subquery of a 100k query,
> where blasting both gives different hsps in the smaller and the bigger
> blast. I'd really like a second opinion on this.

Local versus global alignment.  Smaller sequences might get different
alignment as they are have fewer constraints.  In your case, the 5k
subquery is lacking 95% of the rest of the information, so is less
constrained in terms of alignment than compared to the larger query.  In
this case, you can think of the results of the 5k query as sampling a
solution region of similar high-alignment's or hsp's.  Sort of a Fermi
transition principle for alignment to a "cloud" of
closely-residing-in-parameter-space alignments.  Fewer constraints
allows BLAST to optimize the selection from this cloud in terms of the
supplied parameters. That optimization is more constrained with the
larger query.

The important question is whether or not the differences are real and
biologically relevant, versus being artifacts of the model used for the
alignment.  There is no guide that I am aware of for this, more of an
intuition.

> > E-values on individual hits vary more than an order of magnitude when
> I
> > update BLAST reports that are more than a year out of date against the
> > public repositories.  They also vary slightly depending on the
> bit-width
> > (32 vs. 64) of the architecture I'm using.
> 
> Very interesting. Could the first case you state have anything to do
> with the fact that the actual database is probably a lot bigger now than
> when you first did your blast, which would naturally give a different
> e-value.
> The last case seems unavoidable.

Yes, E-values are by definition, database size dependent, and are local
scores.  You can account for the size dependency change by keeping track
of the database size, or you can look for scale-independent metrics. 
There is a discussion of this on the BLAST course/tutorial pages.(see
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html)

The rounding issue is an artifact of things like numerical roundoff
differences between CPUs, as a result of how the compiler orders
operations.  You can alter this to a degree by using compiler options to
enforce IEEE 754/854 compliance, at the cost of some speed.

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615