On Thu, 4 Mar 2004, Susan Chacko wrote: > Has anyone successfully built the human genome db with mpiformatdb? Is > there some special gotcha because there are very few, very large > sequences (25 sequences in 3 Gb)? Haven't tried it until today, but I did run into the same problem. It turns out that this is a bug in the NCBI Toolkit: When mpiformatdb is asked to generate a 25 fragment database it in turn asks NCBI formatdb to generate a database where each fragment is no larger than 123MB in size. The first entry in human_genome is larger than 123MB (Chromosome 1 is > 200Mbp). Rather than placing this first sequence in the first fragment, a new fragment is immediately created, resulting in an empty .00 fragment. The fix for this bug is very simple. Change the line in readdb.c that says: if ((options->bases_in_volume && (fdbp->TotalLen + SequenceLen > options->bases_in_volume)) || to read if ((options->bases_in_volume && (fdbp->TotalLen + SequenceLen > options->bases_in_volume) && fdbp->TotalLen > 0 ) || In the Nov 14. Toolbox release this line of code was in the function FDBAddSequence2(). It looks like the latest source code has moved this code into a new function called FDBCreateNewVolume(). This bug ought to get fixed in the primary Toolkit codebase. Is that something you can take care of? On an unrelated note, I'll be putting e-value accuracy patches in our mpiBLAST CVS real soon. E-values for blastn have always been fairly accurate and the patches improve accuracy for blastp and translated searches. -Aaron