segmenting blast databases (was Re: [Bioclusters] Details on a local blast cluster question)

Chris Dagdigian
Thu, 30 Jan 2003 11:16:36 -0500

Sergio Ahumada N wrote:
> El Martes 28 Enero 2003 23:22, Tim Harsch escribió:
>>How is it that your results are not right?  Do you mean to say that you
>>have two databases.  A) a single FASTA for nt of roughly 6 gig formatted
>>with formatdb (no -v param) and B) the same fasta file formatted with
>>formatdb -v to split it into several pieces.  And using the same query
>>sequence you get different results with A and B?
> Yes. I don't get the same results. 
> Maybe is my problem .. so you can say me how join the results of B) when each 
> output of the nt splitted database is processed by separately CPUs of the 
> cluster. 
> Greetings !
> --SAN


Forgive me in advance if I am making a stupid and obvious point, ok? 
There was not enough detail in your email to know for certain if I'm 
correctly guessing what is going wrong with your searches...

It's actually only semi-complicated to segment blast databases, 
multiplex your query against the segments and get back correct results.

Perhaps you are missing the easy fix? --> Forcing ncbi-blast via the 
"-z" command line switch to use the full size of the non-segmented 
database when calculating the scores of a search done against the 
(smaller) segmented database. This will cause your blast search results 
to come back with (hopefully) the same values as if you had searched the 
full non-segmented database.


Some of the HSP result scores calculated by a blast search are based on 
the total effective size of the search space (your target database).

This is why you will get different results/scores if you search a 
sequence against a full database and then repeat the same search against 
a segment or smaller piece of the full database.


Back in the old days of bioinformatics :) getting around this problem 
used to be a giant pain in the ass and involved manually parsing out and 
correcting all of the scores from your segmented blast results. It was 
doable but the process was open to parsing and statistical errors and 
was just Not Fun.

This problem went away about 2 years ago (or more, not sure) when NCBI 
did 2 things to the blastall binary:

  o Added the "-z" option to explicitly ovverride the effective length 
of the database (added in blast release 2.0.4)

  o XML output option (added in blast release 2.1.2)

The "-z" switch is a big deal -- it enabled queries to be run against 
segmented databases while still returning the results and scores that 
one would expect when doing the same query against the full database.

XML output made the task of merging the multiple result files from a 
search against N segments a bit easier. Not trivial from what I can 
recall as we saw that sometimes the merge process took more time and 
more compute resources (ie memory) than the actual database search.