Sergio Ahumada N wrote: > El Martes 28 Enero 2003 23:22, Tim Harsch escribió: > >>How is it that your results are not right? Do you mean to say that you >>have two databases. A) a single FASTA for nt of roughly 6 gig formatted >>with formatdb (no -v param) and B) the same fasta file formatted with >>formatdb -v to split it into several pieces. And using the same query >>sequence you get different results with A and B? > > > Yes. I don't get the same results. > > Maybe is my problem .. so you can say me how join the results of B) when each > output of the nt splitted database is processed by separately CPUs of the > cluster. > > Greetings ! > > --SAN Hello, Forgive me in advance if I am making a stupid and obvious point, ok? There was not enough detail in your email to know for certain if I'm correctly guessing what is going wrong with your searches... It's actually only semi-complicated to segment blast databases, multiplex your query against the segments and get back correct results. Perhaps you are missing the easy fix? --> Forcing ncbi-blast via the "-z" command line switch to use the full size of the non-segmented database when calculating the scores of a search done against the (smaller) segmented database. This will cause your blast search results to come back with (hopefully) the same values as if you had searched the full non-segmented database. Detail: Some of the HSP result scores calculated by a blast search are based on the total effective size of the search space (your target database). This is why you will get different results/scores if you search a sequence against a full database and then repeat the same search against a segment or smaller piece of the full database. Solution: Back in the old days of bioinformatics :) getting around this problem used to be a giant pain in the ass and involved manually parsing out and correcting all of the scores from your segmented blast results. It was doable but the process was open to parsing and statistical errors and was just Not Fun. This problem went away about 2 years ago (or more, not sure) when NCBI did 2 things to the blastall binary: o Added the "-z" option to explicitly ovverride the effective length of the database (added in blast release 2.0.4) o XML output option (added in blast release 2.1.2) The "-z" switch is a big deal -- it enabled queries to be run against segmented databases while still returning the results and scores that one would expect when doing the same query against the full database. XML output made the task of merging the multiple result files from a search against N segments a bit easier. Not trivial from what I can recall as we saw that sometimes the merge process took more time and more compute resources (ie memory) than the actual database search. --Chris