[Bioclusters] Details On A Local Blast Cluster

Sat, 5 Oct 2002 17:08:40 -0400 (EDT)

Hello/Bonjour,

I wanted to provide a bit of information about our local
blast server for the benefit of those looking to do the 
same. A mere 6 months ago when I first went about this I 
didn't have a solid grasp of all the issues (not that I do 
now) but I've certainly learned a great deal and don't 
mind passing that on with the sincere hope that I can help 
others engaged in similar pursuits. 

We had two aims: 

1) Be able to use Blast (NCBI & WU-BLAST) with millions
   of sequence reads against a given genome

2) Offer a local ,web-based implementation of NCBI Blast
   for those tired of long queue waits at NCBI

We have been able to achieve both goals using the same 
cluster setup although we are finding that we need to expand 
to accommodate researchers who have since discovered
the existence of the cluster and wanted to jump on board. 

Our setup is very modest. We have 14 CPUS - 6 Appro 
1100 (www.appro.com) with Dual AMD Athlons 1600+ with  
2 GB RAM each. We have 2 40 GB ATA drives per node running 
RedHat 7.3. Our  decision to go with Appro was based purely 
on cost since one of our sources of funding  backed out at 
the last minute. We were looking at an RLX solution (see 
discussion down low) but the money wasn't there so Appros 
were selected.  We went with fast ethernet, a cheap switch, 
and a $400 rack to house it all. We did have to install a 
dedicated circuit to accommodate electrical load but we 
house the setup in a standard office. It's a bit noisy and 
warm but fine.

We wanted to be able to house database splits locally on
each node since I did not want to rely on NFS to supply 
the databases.  This has worked well despite the 
hassle ( a minimal one) of pushing out data to each node 
after a new version of a database comes out. That's
soon to be automated - for example download the latest 
version of nr, split it , formatdb each split, and push out 
the splits to each cluster node. The script is easily written.

We purchased Platform LSF 5.0 licenses to manage the 
cluster and as a side benefit they had example Perl scripts 
that provided working examples on how to split up target 
databases and associated queries to take advantage of the 
cluster thus economizing search time. There is nothing 
particularly magic about these programs though they do work 
well. You could certainly write your own or easily modify 
theirs to suit your specific needs. Its also possible to 
adapt the scripts for use with  GridEngine or PBS. 

I do like LSF a great deal  and the support I have received
from Platform has been very good. Despite the appeal of LSF  
I think its becoming clear that Grid Engine 
could be used to accomplish many of the same things. I 
like LSF and if our budget holds out then I will retain
those licenses  but SGE is free and works pretty well also. 
Perhaps some SGE zealot could write a LSF to 
SGE conversion document ? 

With regard to our first aim it turned out that BLAST was
not really a bottleneck but rather the vector screening and 
repeatmasking .  We did employ the option of repeatmasker which 
selects WU_BLAST as a masking tool instead of  the default
cross_match.  This speeded things up quite a bit.  In any
case ,using the cluster, we were able to knock out 
screening and masking in about 1/30 of the time it used to 
take before we had the cluster. A huge win for not a lot of money,
Granted some of the performance improvment was due to learning
how better to employ various programs in the pipline 
but the cluster was undenibaly the key factor in performance 
enhancement.

With regard to our second aim we have been able to offer 
Web-based NCBI-like services to a select group of people on 
an intranet. They load a web page, login, get a BLAST page,
paste in a sequence, select a target databases and program 
and submit the BLAST which gets distributed to the cluster 
for processing. We have the databases split  6 ways 
which means the databases can fit into the memory on a
given node. With only 14 CPUs we certainly aren't setting 
any speed  records but by limiting the availability  of the 
service combined with the load balancing we can return results 
back to people within a minute or two even for translated Blasts
against larger databases.

Obviously this scenario is a queuing problem since we 
never know how many simultaneous users are 
going to be  kicking off a job. Even so we have developed
different queues for different  users and the various types of 
BLASTs in an effort to provide a fair use policy. The result they 
get back is a single report merged from other reports. They 
get active links back to NCBI. 

We are lacking the alignment graphic which appears with 
standard NCBI issued reports though I would  like to be able to 
provide that. Thus far I haven't found a quick way
to take my Blast report and run it through a program to
produce that graphic. The NCBI helpdesk referred me to their 
toolbox and said I could dig out the code and write my own 
version but I was hoping someone had done this already. We 
might write our own but its not a major issue. The users are 
reasonably content with the reports and active links.

Of course as we all know there are a number of companies
offering ready made cluster solutions:

RLX,  MicroWay,  RackWay, Penguin, HP ProLiant, and Sun
is getting in on the action. So you might benefit from
a discussion with one or more of these vendors should
you be in the cluster market. I've been talking with 
RLX recently and I really like their control tower concept
which has some very nice software tools to let you 
manage and provision their blades. They've done a good 
job in that area. 

Plus they have small footprint and low power consumption that 
is ideal for non data center clusters. So if you wanted to
setup a cluster in then corner of your laboratory then you could.
They resell LSF so you get that under the hood. I am not 
thrilled with their use of laptop quality drives on the blade 
but depending on your application this might not be such a 
big deal. Also I think they have some new stuff coming out so 
check with them to get the latest. One thing is clear - a 
lot of vendors are out there ready to sell you a cluster. 
Take your time.

Relative to software I also have tested out TurboGenomic's
TurboBlast and have recently been evaluating Paracel's
Blast product. Both of these have their strengths and
understand they are packages designed to do high throughput 
blasting so there isn't really load management built in 
to them. They are meant to benefit from a cluster environment
so you can quite easily use either with , for more about my
experiences with either of these products then drop me a line.

Regards,

Steve Pittard	 | http://catalina.bimcore.emory.edu (HOME PAGE)
Emory University | wsp@emory.edu, wsp@bimcore.emory.edu  (INTERNET) 
BIMCORE Support	 | 404 727 0038