Malay wrote: > In a recent post I mentioned that "pre-splitting" database screws up > BLAST statistics. Aaron Darling pointed out the mpiBLAST version 1.3.0 > gets the statistics just right. I apologise for my ignorace. But I am > curious though how they do it. Can anyone point me to any information? > I guess I would be the most qualified person to answer that :) blast e-value statistics represent the probability of seeing a particular alignment between a database and a query of particular lengths. Rather than use raw sequence lengths blast calculates effective sequence lengths, which are adjusted to account for edge effects. Karlin and Altschul have a few PNAS papers describing the statistics behind edge effects. In order to calculate accurate e-value statistics the effective query and database lengths need to be used. Immediately after startup, the rank 0 mpiblast process uses the NCBI Toolbox code to calculate the effective query and database lengths for each query. It then tree-broadcasts these values to all other mpiblast processes. During the search, the workers report hits using the effective query and database lengths to calculate the e-values. If you're interested in the gory details of the code I'll refer you to the small NCBI toolbox patch included with mpiBLAST. The patch allows mpiblast to cull effective query and db lengths, and later, set them during the search process. It's called ncbi_Oct2004_evalue.patch -Aaron