[Bioclusters] Details On A Local Blast Cluster

Steve Pittard bioclusters@bioinformatics.org
Mon, 7 Oct 2002 14:32:49 -0400 (EDT)


Well as has been pointed out you could use -v on formatdb 
but we basically user a perl program to take a raw file 
and split it into N pieces being careful to make sure all
files begin with a FASTA designation. That is we don't
want to split a given sequence across files. Then we
formatdb the splits and push all splits to all nodes.

One could dedicate a given node to a given split but then
you lose generality in the cluster. It could be argued that
if you stash a piece of say NR on  node 3 you could 
be assured of having that split in memory at all times on
that node thus facilitating a more rapid result but I've never taken
the time or energy to experiment with this especially since
I've been happy with the results thus far.

The flow is as follows for the end user Blast page.

1) User loads up a web page and logs in. He/She gets a list
   of the 5 most recently completed BLAST results. We stash
   the results URLs in a mysql database. This is so users won't
   keep doing the same Blast over and over. (Some still do anyway)
   They can click the history link and they have their results.

2) They get a Blast page which looks very similar to the one at NCBI

3) They paste in a sequence, select a database and a program and 
   hit submit

4) The backend Perl program then does a number of things:
 
   *) Scrapes the form for BLAST options

   *) Creates a results directory  based on PID
 
   *) Inspects the cluster for health (does some LSF status commands 
      to make sure the cluster is ready for submissions)

   *) Then calls another Perl program which kicks off six different
      batch jobs (we call bsub since we are using LSF). Another bsub job
      then waits around until all 6 "sub-blasts" are done. Then the
      reports are collected and merged into the single larger report.
      We run this through another Perl program to add in links back to
      NCBI and add other types of info.  


After the user hits submit in Step #3 we return an
acknowledgement page along with the  URL of a results page and 
the current number of BLAST jobs currently being run. They 
are given the choice of returning to the main pages to submit 
another BLAST or going to the results URL which is a self updating 
page. 
  
On the LSF backend, we reward the more targeted search.
That to say we have created different queues based on the database 
being searched  and program being used. Some users 
*always* do the most general search and translated ones at that so to
deal with this we created a "translated" queue which drains more 
slowly  than the one for people doing say blastn against E-Coli. 

For the smaller databases (microbial for example) we don't
even split them up. We just format them, push them out and
use one bsub to handle them. Actually we don't really have to 
use the cluster to get these handled but we do since we have
fast running  queues for these databases. 
   
Now. If you wanted to use TurboGenomics TurboBlast or Paracel's
Blast then you basically insert it after step #3 above. Both
products require some prep work before you can submit a search
against a database but its usually a one time affair so after 
that you can have your backend Perl program call , 
,(in the case of TurboBlast), "tblastall". 

Both products take care of all the database distribution and 
report merging so you don't have to do any of that. Your 
Perl program to call them is short and sweet. And you 
could run them under LSF,SGE,or  PBS. (But you don't have to).

Also keep in mind that both products offer their own Web faces  
so you don't even have to write your own so you could go that  
route if you wanted to avoid the programming. I'm sure that
both companies would provide demos.

I like having the cluster since it solves a number of general 
problems for us. The cool thing about having any load management 
system is that you can accomodate the casual Blast user AND the
researcher who wants to blast millions of reads. In the latter
case we have queues setup to hold those jobs until after
5 p.m. at which point they kick in. The Web users take back
seat to these jobs until 6 a.m. at which point the heavy duty
jobs take back seat. We really like being able to maximize
the use of the cluster in this fashion. Whatever load management
software you decide to use take the time to get into it. 


Steve Pittard	 | http://catalina.bimcore.emory.edu (HOME PAGE)
Emory University | wsp@emory.edu, wsp@bimcore.emory.edu  (INTERNET) 
BIMCORE Support	 | 404 727 0038 

On Mon, 7 Oct 2002, Bala wrote:

> Hi Steve,
>         Thanks for the info related to BLAST
> running on cluster environment, I am running 
> NCBI blast in a single machine, I am planning
> to run on a clustered environment, I installed
> SGE, after that I am wondering how to split the 
> database and how about the BLAST, we need to 
> have seperate copy in each exec node??
> 
>  if you don't mind give me some more info in this.
> 
> -bala-
> 
>  
>          
>             
> 
> 
> 
> 
> 
> 
> 
> --- Tim White <w.t.white@massey.ac.nz> wrote:
> > Hello Steve,
> > 
> > Just wanted to say thanks for some very useful
> > comments on running BLAST on
> > a cluster for interactive use, which is what we will
> > be doing at the Allan
> > Wilson Centre too.  It was good to hear about
> > specifics like how many ways
> > you had to partition the database to fit it into
> > memory (6), your choice of
> > load management software and niggling things like
> > the alignment graphic, all
> > of which are issues we will be facing shortly.
> > 
> > Thanks!
> > 
> > Tim White
> > 
> > ----- Original Message -----
> > From: "Steve Pittard" <wsp@emory.edu>
> > To: "biocluster" <bioclusters@bioinformatics.org>
> > Sent: Sunday, October 06, 2002 10:08 AM
> > Subject: [Bioclusters] Details On A Local Blast
> > Cluster
> > 
> > 
> > >
> > > Hello/Bonjour,
> > >
> > > I wanted to provide a bit of information about our
> > local
> > > blast server for the benefit of those looking to
> > do the
> > > same. A mere 6 months ago when I first went about
> > this I
> > > didn't have a solid grasp of all the issues (not
> > that I do
> > > now) but I've certainly learned a great deal and
> > don't
> > > mind passing that on with the sincere hope that I
> > can help
> > > others engaged in similar pursuits.
> > >
> > >
> > > We had two aims:
> > >
> > > 1) Be able to use Blast (NCBI & WU-BLAST) with
> > millions
> > >    of sequence reads against a given genome
> > >
> > > 2) Offer a local ,web-based implementation of NCBI
> > Blast
> > >    for those tired of long queue waits at NCBI
> > >
> > > We have been able to achieve both goals using the
> > same
> > > cluster setup although we are finding that we need
> > to expand
> > > to accommodate researchers who have since
> > discovered
> > > the existence of the cluster and wanted to jump on
> > board.
> > >
> > > Our setup is very modest. We have 14 CPUS - 6
> > Appro
> > > 1100 (www.appro.com) with Dual AMD Athlons 1600+
> > with
> > > 2 GB RAM each. We have 2 40 GB ATA drives per node
> > running
> > > RedHat 7.3. Our  decision to go with Appro was
> > based purely
> > > on cost since one of our sources of funding 
> > backed out at
> > > the last minute. We were looking at an RLX
> > solution (see
> > > discussion down low) but the money wasn't there so
> > Appros
> > > were selected.  We went with fast ethernet, a
> > cheap switch,
> > > and a $400 rack to house it all. We did have to
> > install a
> > > dedicated circuit to accommodate electrical load
> > but we
> > > house the setup in a standard office. It's a bit
> > noisy and
> > > warm but fine.
> > >
> > > We wanted to be able to house database splits
> > locally on
> > > each node since I did not want to rely on NFS to
> > supply
> > > the databases.  This has worked well despite the
> > > hassle ( a minimal one) of pushing out data to
> > each node
> > > after a new version of a database comes out.
> > That's
> > > soon to be automated - for example download the
> > latest
> > > version of nr, split it , formatdb each split, and
> > push out
> > > the splits to each cluster node. The script is
> > easily written.
> > >
> > > We purchased Platform LSF 5.0 licenses to manage
> > the
> > > cluster and as a side benefit they had example
> > Perl scripts
> > > that provided working examples on how to split up
> > target
> > > databases and associated queries to take advantage
> > of the
> > > cluster thus economizing search time. There is
> > nothing
> > > particularly magic about these programs though
> > they do work
> > > well. You could certainly write your own or easily
> > modify
> > > theirs to suit your specific needs. Its also
> > possible to
> > > adapt the scripts for use with  GridEngine or PBS.
> > >
> > > I do like LSF a great deal  and the support I have
> > received
> > > from Platform has been very good. Despite the
> > appeal of LSF
> > > I think its becoming clear that Grid Engine
> > > could be used to accomplish many of the same
> > things. I
> > > like LSF and if our budget holds out then I will
> > retain
> > > those licenses  but SGE is free and works pretty
> > well also.
> > > Perhaps some SGE zealot could write a LSF to
> > > SGE conversion document ?
> > >
> > > With regard to our first aim it turned out that
> > BLAST was
> > > not really a bottleneck but rather the vector
> > screening and
> > > repeatmasking .  We did employ the option of
> > repeatmasker which
> > > selects WU_BLAST as a masking tool instead of  the
> > default
> > > cross_match.  This speeded things up quite a bit. 
> > In any
> > > case ,using the cluster, we were able to knock out
> > > screening and masking in about 1/30 of the time it
> > used to
> > > take before we had the cluster. A huge win for not
> > a lot of money,
> > > Granted some of the performance improvment was due
> > to learning
> > > how better to employ various programs in the
> > pipline
> > > but the cluster was undenibaly the key factor in
> > performance
> > > enhancement.
> > >
> > >
> > > With regard to our second aim we have been able to
> > offer
> > > Web-based NCBI-like services to a select group of
> > people on
> > > an intranet. They load a web page, login, get a
> > BLAST page,
> > > paste in a sequence, select a target databases and
> > program
> > > and submit the BLAST which gets distributed to the
> > cluster
> > > for processing. We have the databases split  6
> > ways
> > > which means the databases can fit into the memory
> > on a
> > > given node. With only 14 CPUs we certainly aren't
> > setting
> > > any speed  records but by limiting the
> > availability  of the
> > > service combined with the load balancing we can
> > return results
> > > back to people within a minute or two even for
> > translated Blasts
> > > against larger databases.
> > >
> > > Obviously this scenario is a queuing problem since
> > we
> > > never know how many simultaneous users are
> > > going to be  kicking off a job. Even so we have
> > developed
> > > different queues for different  users and the
> > various types of
> > > BLASTs in an effort to provide a fair use policy.
> > The result they
> > > get back is a single report merged from other
> > reports. They
> > > get active links back to NCBI.
> > >
> > > We are lacking the alignment graphic which appears
> > with
> > > standard NCBI issued reports though I would  like
> > to be able to
> > > provide that. Thus far I haven't found a quick way
> > 
> === message truncated ===
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Faith Hill - Exclusive Performances, Videos & More
> http://faith.yahoo.com
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>