[Bioclusters] BLAST/ PBS / Grid Engine

Fri, 17 May 2002 18:56:59 -0400

Hi Steve,

we don't run a Blast web server here, but we have multiple users 
submitting thousands (or ten-thousands) of Blast jobs to the cluster, 
and I think from a computational standpoint that's the same.  So how do 
we distribute those 10^4 or 10^5 Blast jobs from the login node over 
the cluster nodes?  We use SGE to distribute the jobs, and we do not 
partition the database into fragments.  In contrast, we put identical 
copies of the database (or databases) on each slave node, and whenever 
a new job arrives, the database is read just from the local disk.  Of 
course that requires that the database will fit in RAM, so you should 
consider at least 1 GB RAM per processor.

What we partition is the query sequence, typically into fragments of 
101 kb overlapping by 1 kb, or into fragments of 1001 kb overlapping by 
1 kb.  That partitioning will allow you, for example, to Blast the 
human genome against the mouse genome in a few days, so in your case it 
may prevent the situation where many nodes are idle while one node is 
busy for months.  (I guess, if the comparison of mouse to human is 
assigned to just one node, it may indeed run for several months.)

Best regards, Ivo