[Bioclusters] database segmentation
Karl Podesta
bioclusters@bioinformatics.org
Fri, 27 Feb 2004 14:39:57 +0000
On Fri, Feb 27, 2004 at 10:30:39AM +0000, Dan Bolser wrote:
> Hi,
>
> Regarding the previous discussion on blast parallelization, I would like
> to know more about segmentation.
>
> Can anyone give me a reference to this topic? If I split my target
> database over n machines, doesn't that mean I have to run my query n
> times?
>
> Cheers,
> Dan.
It does, but that's n times "in parallel" - meaning you're roughly looking
at the execution time for a single query (reduced size). You could wrap the
query up with a shell script too, so that you only physically need to run
a 'single script' rather than n seperate queries.
Generally there are 3 levels of segmentation used for BLAST - the query,
the database, or by the number of queries. Most implementations use one form
or the other, but rarely/never a combination?
General theory:
R.C. Braun, K.T. Pedretti, T.L. Casavant, T.E. Scheetz, C.L. Birkett, C.A.
Roberts. "Parallelization of local BLAST service on workstation clusters".
Future Generation Computer Systems, 2001, vol. 17, pp 745-754.
Segmenting the number of queries:
R. Clifford and A.J. Mackey. "Disperse: a simple and efficient approach to
parallel database searching". Bioinformatics, 2000, vol. 16, no. 6, pp
564-565.
Segmenting the database:
D.R. Mathog. "Parallel BLAST on split databases". Bioinformatics, 2003,
vol. 19, no. 14, pp 1865-1866.
A.E. Darling, L. Carey, W. Feng. "The Design, Implementation, and
Evaluation of mpiBLAST". ClusterWorld Conference & Expo and the 4th
International Conference on Linux Clusters: The HPC Revolution 2003.
Kp
--
Karl Podesta
Dublin City University, Ireland
National Institute for Cellular Biotechnology, Ireland