[Bioclusters] Daemonizing blast, ie running many sequences through 1 process

Ian Korf bioclusters@bioinformatics.org
Fri, 7 Nov 2003 09:56:05 +0000


If you have space on your nodes then you should definitely store the 
BLAST databases locally. Depending on the BLAST parameters, you may 
want to split the database into chunks that are cache-able and then 
merge the results. BLASTN with highly insensitive parameters on a fast 
machine can become IO-limited (and if you're running such insensitive 
parameters then perhaps you ought to use a different algorithm). BLASTP 
and others are not going to benefit much from caching as they spend 
quite a bit of time exploring alignments. Sounds like you're running 
BLASTN though. Now if you don't have room on your nodes to hold the 
database (maybe the nodes are diskless) then you definitely want to 
split your database into cache-able chunks and then you'll only have to 
read each slice once over NFS. If you're a WU-BLAST user, you can 
actually do the splitting dynamically with command line parameters and 
you don't have to physically split the database. One advantage of 
keeping the data on an NFS server is that it is easier to manage 
updates. I think dynamic splitting over NFS is a very good idea for 
diskless nodes. The server ought to have loads of RAM so it can serve 
from its memory rather than disk and it ought to have a gigabit 
ethernet (the nodes can have 100 Mbit since they are only going to be 
IO limited on request of the first slice). <shameless_plug>Such topics 
are covered in chapter 12 "Hardware and Software Optimizations" of the 
O'Reilly BLAST book.</shameles_plug>

-Ian

On Friday, November 7, 2003, at 06:07 AM, Michael.James@csiro.au wrote:

> We have a problem with 66 nodes becoming NFS bound
>  when blasting many (>10,000) sequences
>  against the same database set.
>
> One approach (which we are trying) is to cache database files locally,
>  so nodes can re-read their files without bottlenecking on NFS.
>
> A totally different approach, with even better performance potential,
>  would be if a blast process could start up, load its database(s)
>  and process multiple queries until told to exit.
>
> This dilutes the startup cost across all the jobs to be run on that 
> node.
>
> Does NCBI blast do this?
> Is there a blast that does?
> Anyone interested in writing one?
> What's involved?
>
> Thanks for any pointers,
> michaelj
>
> -- 
> Michael James				michael.james@csiro.au
> System Administrator			voice:	02 6246 5040
> CSIRO Bioinformatics Facility	fax:		02 6246 5166
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>