[Bioclusters] requesting design advice on grid-optimized genome annotation system

Mon Oct 24 04:59:48 EDT 2005

On 23 Oct 2005, at 7:46 pm, Gary Van Domselaar wrote:

> Hey Gang,
>
> I'm designing a system for automatic prokaryotic genome  
> annotation.  The system will need to annotate (typically several  
> thousand) coding regions, in part by BLASTing multiple reference  
> databases, like COGs, UNIPROT, ncbi nr etc.  Im wondering about the  
> most efficient way to do this using my Xserve cluster and mpi- 
> blast.  Im cool with prestaging the mpi-blast-formatted databases  
> onto the compute nodes, and my intuition tells me it would be best  
> to blast the set of coding regions against one reference database  
> at at time, ie blast all coding regions against COGs, then again  
> against UNIPROT, etc.  That way the reference databases can stay  
> resident in RAM for the entire blast run against the genome coding  
> regions.  Does this sound right?  Will this actually happen?

You may find you get better throughput by just running single- 
threaded blast jobs, but I don't know.  The latter approach is used  
by the Ensembl raw compute pipeline (which is open-source software,  
so if you want to follow their approach, you can just grab the source  
with CVS from cvs.sanger.ac.uk)  We don't use MPI blast at all.

I'd have thought that with your approach, the more input sequences  
you use at once, the better, although you may find it requires too  
much memory at the result collation stage - you'll presumably have a  
very large number of hits.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233