[Bioclusters] mpiBLAST statistics
Aaron Darling
darling at cs.wisc.edu
Wed Jan 5 13:16:36 EST 2005
Malay wrote:
> In a recent post I mentioned that "pre-splitting" database screws up
> BLAST statistics. Aaron Darling pointed out the mpiBLAST version 1.3.0
> gets the statistics just right. I apologise for my ignorace. But I am
> curious though how they do it. Can anyone point me to any information?
>
I guess I would be the most qualified person to answer that :)
blast e-value statistics represent the probability of seeing a
particular alignment between a database and a query of particular
lengths. Rather than use raw sequence lengths blast calculates
effective sequence lengths, which are adjusted to account for edge
effects. Karlin and Altschul have a few PNAS papers describing the
statistics behind edge effects. In order to calculate accurate e-value
statistics the effective query and database lengths need to be used.
Immediately after startup, the rank 0 mpiblast process uses the NCBI
Toolbox code to calculate the effective query and database lengths for
each query. It then tree-broadcasts these values to all other mpiblast
processes. During the search, the workers report hits using the
effective query and database lengths to calculate the e-values.
If you're interested in the gory details of the code I'll refer you to
the small NCBI toolbox patch included with mpiBLAST. The patch allows
mpiblast to cull effective query and db lengths, and later, set them
during the search process. It's called ncbi_Oct2004_evalue.patch
-Aaron
More information about the Bioclusters
mailing list