[Bioclusters] free parallel versions of BLAST

Micha Bayer bioclusters@bioinformatics.org
27 Feb 2004 09:59:12 +0000


Thanks to Aaron, Jason and Dan for their help, this is very useful.

On a related note: does anyone know how the NCBI BLAST executable deals
with the query and the database in terms of memory? I have had a
discussion with a colleague of mine who claims that BLAST never loads
the database into memory at all but the query does get loaded into
memory. 

Is it possible to tell BLAST what to load into memory (availability
permitting obviously)?

cheers

Micha


On Thu, 2004-02-26 at 20:09, Aaron Darling wrote:
> Because parallel BLAST is such a common problem, numerous free/open-source
> implementations exist.  Obviously mpiBLAST won't work on a unix cluster
> without message passing, and its unclear to me whether MPI and condor will
> play nice with each other on windows (anybody have success with this?).
> If there are other reasons mpiBLAST is unsuitable for you I'd like to hear
> about them, the software is still being actively developed and we are open
> to suggestion for features!
> 
> As part of writing a grant for the mpiBLAST project I did some research on
> other free, open-source parallel BLAST options.  Here's a brief overview
> of what I was able to find.  If I'm missing any significant projects or
> I've got the details wrong please correct me.  Also, this only covers
> parallelizations that use database segmentation.  Because query
> segmentation is easy so many programs have been written to use it
> exclusively that I'd be hard pressed to list them all.
> 
> NBLAST
> - designed for NxN comparisons of sequence databases, e.g. every database entry gets BLAST searched against every other database entry
> - stores results in ASN.1 format
> - adjusts e-values using the database length only, providing approximately correct e-values
> - uses MoBiDiCK for job startup on a cluster
> - there is a paper describing it here:  http://www.biomedcentral.com/1471-2105/3/13/
> - uses unmodified NCBI blastall
> 
> 
> blast.pm
> - database segmentation
> - part of the mollusc package
> - written in perl, works under unix
> - uses rsh/ssh for job startup (need password-free login to cluster machines)
> - adjusts e-values using a linear-regression model that provides approximate e-value statistics
> - supports text output formats only
> - uses unmodified NCBI blastall
> 
> 
> dBlast
> - database segmentation
> - free only for non-commercial use
> - written in perl, works under unix
> - requires manual database distribution
> - requires OpenPBS for job management
> - e-value adjustments are (purportedly) accurate.  dBlast uses both
>   the effective db length and the effective query length to calculate
>   e-values.  Their clever method for e-value adjustment inspired us to
>   make some changes for the next mpiBLAST release to give accurate e-value
>   statistics.
> - supports text output formats only
> - requires compiling a modified NCBI blastall
> - see http://www.cmbi.kun.nl/software/dBlast/ for more info
> 
> 
> parallelblast by David Mathog
> - database segmentation
> - written in perl/C, works under unix
> - uses PVM, and optionally SGE
> - does approximate e-value adjustment using the effective db length
> - supports text and html output formats
> - requires compiling a modified NCBI blastall
> - http://bioinformatics.oupjournals.org/cgi/content/abstract/19/14/1865?ijkey=13CoOSo3fnITz&keytype=ref
> 
> 
> mpiBLAST
> - database segmentation
> - written in c++, works under unix/windows
> - requires MPI, optionally PBS, SGE, LSF, or Condor
> - e-value adjustments are approximate based on db. length (but as previously mentioned, the next release will include accurate e-value statistics)
> - supports all of the NCBI blastall output formats (text, html, XML, ASN.1)
> - requires compiling the NCBI Toolkit
> - includes code to interface a wwwblast server with mpiBLAST + PBS
> - more info at http://mpiblast.lanl.gov
> 
> 
> Also since you're considering BLAST under Windows, you may want to check
> into what the Cornell Theory Center is using for parallel BLAST on their
> windows cluster.  I don't know whether their software is publicly or
> freely available however.
> 
> None of the freely availably options (that I am aware of) currently
> implement combined query and database segmentation.
> 
> I've found the lack of a comprehensive resource for information
> on parallel BLAST frustrating.  Hopefully this e-mail will prove to be a
> useful resource for people considering parallel BLAST options.
> 
> 
> -Aaron
> darling(at)cs.wisc.edu
> 
> 
> 
> On Thu, 26 Feb 2004, Micha Bayer wrote:
> 
> > Hi,
> >
> > does anyone know of a non-commercial, open source/free package that
> > provides a parallelisation of BLAST (apart from mpiBLAST which is not
> > suitable for us).
> >
> > I am interested in something that would split input files into single
> > query sequences, partition the database and collate the results (ideally
> > with an adjustment of the e-values etc).
> >
> > It looks like some of the commercial packages like Paracel do all of the
> > above but I really need an open source version and before I get writing
> > my own I want to make sure I have tried all the available options.
> >
> > I am looking to run a service both on a Windows XP based Condor pool and
> > on a cluster that uses OpenPBS but has no message passing capabilities
> > to speak of.
> >
> > cheers
> >
> > Micha
> >
> >
> > --
> > --------------------------------------------------
> > Dr Micha M Bayer
> > Grid Developer, Bridges Project
> > National eScience Centre, Glasgow Hub
> > 246c Kelvin Building
> > University of Glasgow
> > Glasgow G12 8QQ
> > Scotland, UK
> > Email: michab@dcs.gla.ac.uk
> > Project home page: http://www.brc.dcs.gla.ac.uk/projects/bridges/
> > Personal Homepage: http://www.brc.dcs.gla.ac.uk/~michab/
> > Tel.: +44 (0)141 330 2958
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> >
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
-- 
--------------------------------------------------
Dr Micha M Bayer
Grid Developer, Bridges Project
National eScience Centre, Glasgow Hub
246c Kelvin Building
University of Glasgow
Glasgow G12 8QQ
Scotland, UK
Email: michab@dcs.gla.ac.uk
Project home page: http://www.brc.dcs.gla.ac.uk/projects/bridges/
Personal Homepage: http://www.brc.dcs.gla.ac.uk/~michab/
Tel.: +44 (0)141 330 2958