[Bioclusters] database segmentation

Karl Podesta bioclusters@bioinformatics.org
Fri, 27 Feb 2004 14:39:57 +0000


On Fri, Feb 27, 2004 at 10:30:39AM +0000, Dan Bolser wrote:
> Hi, 
> 
> Regarding the previous discussion on blast parallelization, I would like
> to know more about segmentation.
> 
> Can anyone give me a reference to this topic? If I split my target
> database over n machines, doesn't that mean I have to run my query n
> times?
> 
> Cheers,
> Dan.

It does, but that's n times "in parallel" - meaning you're roughly looking
at the execution time for a single query (reduced size). You could wrap the 
query up with a shell script too, so that you only physically need to run 
a 'single script' rather than n seperate queries. 

Generally there are 3 levels of segmentation used for BLAST - the query, 
the database, or by the number of queries. Most implementations use one form 
or the other, but rarely/never a combination?

General theory:
R.C. Braun, K.T. Pedretti, T.L. Casavant, T.E. Scheetz, C.L. Birkett, C.A.
Roberts. "Parallelization of local BLAST service on workstation clusters".
Future Generation Computer Systems, 2001, vol. 17, pp 745-754. 

Segmenting the number of queries:
R. Clifford and A.J. Mackey. "Disperse: a simple and efficient approach to
parallel database searching". Bioinformatics, 2000, vol. 16, no. 6, pp
564-565. 

Segmenting the database:
D.R. Mathog. "Parallel BLAST on split databases". Bioinformatics, 2003,
vol. 19, no. 14, pp 1865-1866. 

A.E. Darling, L. Carey, W. Feng. "The Design, Implementation, and
Evaluation of mpiBLAST". ClusterWorld Conference & Expo and the 4th
International Conference on Linux Clusters: The HPC Revolution 2003. 

Kp
--
Karl Podesta
Dublin City University, Ireland 
National Institute for Cellular Biotechnology, Ireland