[Bioclusters] mpiformatdb problem
Aaron Darling
bioclusters@bioinformatics.org
Thu, 4 Mar 2004 15:49:43 -0600 (CST)
On Thu, 4 Mar 2004, Susan Chacko wrote:
> Has anyone successfully built the human genome db with mpiformatdb? Is
> there some special gotcha because there are very few, very large
> sequences (25 sequences in 3 Gb)?
Haven't tried it until today, but I did run into the same problem. It
turns out that this is a bug in the NCBI Toolkit:
When mpiformatdb is asked to generate a 25 fragment database it in turn
asks NCBI formatdb to generate a database where each fragment is no larger
than 123MB in size. The first entry in human_genome is larger than 123MB
(Chromosome 1 is > 200Mbp). Rather than placing this first sequence in
the first fragment, a new fragment is immediately created, resulting in an
empty .00 fragment.
The fix for this bug is very simple. Change the line in readdb.c that
says:
if ((options->bases_in_volume && (fdbp->TotalLen + SequenceLen >
options->bases_in_volume)) ||
to read
if ((options->bases_in_volume && (fdbp->TotalLen + SequenceLen >
options->bases_in_volume) && fdbp->TotalLen > 0 ) ||
In the Nov 14. Toolbox release this line of code was in the function
FDBAddSequence2(). It looks like the latest source code has moved this
code into a new function called FDBCreateNewVolume().
This bug ought to get fixed in the primary Toolkit codebase. Is that
something you can take care of?
On an unrelated note, I'll be putting e-value accuracy patches in our
mpiBLAST CVS real soon. E-values for blastn have always been fairly
accurate and the patches improve accuracy for blastp and translated
searches.
-Aaron