[Bioclusters] mpiformatdb problem

Aaron Darling bioclusters@bioinformatics.org
Thu, 4 Mar 2004 15:49:43 -0600 (CST)


On Thu, 4 Mar 2004, Susan Chacko wrote:

> Has anyone successfully built the human genome db with mpiformatdb? Is
> there some special gotcha because there are very few, very large
> sequences (25 sequences in 3 Gb)?

Haven't tried it until today, but I did run into the same problem.  It
turns out that this is a bug in the NCBI Toolkit:

When mpiformatdb is asked to generate a 25 fragment database it in turn
asks NCBI formatdb to generate a database where each fragment is no larger
than 123MB in size.  The first entry in human_genome is larger than 123MB
(Chromosome 1 is > 200Mbp).  Rather than placing this first sequence in
the first fragment, a new fragment is immediately created, resulting in an
empty .00 fragment.

The fix for this bug is very simple.  Change the line in readdb.c that
says:
      if ((options->bases_in_volume && (fdbp->TotalLen + SequenceLen >
options->bases_in_volume)) ||

to read

      if ((options->bases_in_volume && (fdbp->TotalLen + SequenceLen >
options->bases_in_volume) && fdbp->TotalLen > 0 ) ||


In the Nov 14. Toolbox release this line of code was in the function
FDBAddSequence2().  It looks like the latest source code has moved this
code into a new function called FDBCreateNewVolume().

This bug ought to get fixed in the primary Toolkit codebase.  Is that
something you can take care of?

On an unrelated note, I'll be putting e-value accuracy patches in our
mpiBLAST CVS real soon.  E-values for blastn have always been fairly
accurate and the patches improve accuracy for blastp and translated
searches.

-Aaron