On 23 Oct 2005, at 7:46 pm, Gary Van Domselaar wrote: > Hey Gang, > > I'm designing a system for automatic prokaryotic genome > annotation. The system will need to annotate (typically several > thousand) coding regions, in part by BLASTing multiple reference > databases, like COGs, UNIPROT, ncbi nr etc. Im wondering about the > most efficient way to do this using my Xserve cluster and mpi- > blast. Im cool with prestaging the mpi-blast-formatted databases > onto the compute nodes, and my intuition tells me it would be best > to blast the set of coding regions against one reference database > at at time, ie blast all coding regions against COGs, then again > against UNIPROT, etc. That way the reference databases can stay > resident in RAM for the entire blast run against the genome coding > regions. Does this sound right? Will this actually happen? You may find you get better throughput by just running single- threaded blast jobs, but I don't know. The latter approach is used by the Ensembl raw compute pipeline (which is open-source software, so if you want to follow their approach, you can just grab the source with CVS from cvs.sanger.ac.uk) We don't use MPI blast at all. I'd have thought that with your approach, the more input sequences you use at once, the better, although you may find it requires too much memory at the result collation stage - you'll presumably have a very large number of hits. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233