[Bioclusters] batching of blast searches

Joseph Landman bioclusters@bioinformatics.org
18 Mar 2003 09:43:20 -0500


Hi Andy:

  It is run and job dependent, but I have found that numbers between
7-20 sequences per run gives best throughput.  I have done this study a
number of times, and it definitely changes with each algorithm and
database.

  You are fighting the database load (actually an mmap) time, as well as
the queue latency, against a sequence comparison time, which is
dominated by the search portin.  Your wall clock time for 1 sequence per
queued job will be worse than (for an N cpu run), using N bins to
collect the sequences.  The N bins is also not optimal (subject to the
fuzziness of the information I have hear).  

  What I did to find it is to take a set of jobs, partition them into
1,2,4,8,16,32,64,...,2**12 sequences (I had a large number of tomato
ESTs that I had been using for this).  I then measured the wall clock
time for run completion as a function of the chunk.  Using this, I built
a finer grid (e.g. 4,5,6,7,8,9,...) around the maximum, and reran.  I
was able to eyeball the maximum from the chart.

  Joe

On Tue, 2003-03-18 at 04:23, andy law (RI) wrote:
> All,
> 
> As we start to use our compute farm for biger and bigger tasks, I came
> to realising that the way that we are currently thinking about
> submitting our blast jobs is considerably sub-optimal. Obviously 1 run
> of 100 sequences against a database is much more efficient than 100
> separate runs sgainst the same database. Has anyone developed scripts
> to sit inside some part of a queue submission system (in this case
> SGE) to make these things more efficient? I'm thinking along the lines
> of something that monitors the size and number of queries, notes the
> number of available nodes and batches the jobs up to match one against
> the other?

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615