[Bioclusters] Grid BLAST
david coornaert
dcoorna at dbm.ulb.ac.be
Mon Sep 26 05:22:43 EDT 2005
What are you using to merge the outputs ? (and to manage the stats...)
===============================================
David Coornaert (dcoorna at dbm.ulb.ac.be)
Belgian Embnet Node (http://www.be.embnet.org)
Université Libre de Bruxelles
Laboratoire de Bioinformatique
12, Rue des Professeurs Jeener & Brachet
6041 Gosselies
BELGIQUE
Tél: +3226509975
Fax: +3226509998
===============================================
Tim Cutts wrote:
>
> On 24 Sep 2005, at 7:40 pm, Warren Gish wrote:
>
>>> Hi, I'm the administrator the bioinformatics laboratory at Université
>>> du Québec à Montréal. I have a room filled with dual P4 3GHz
>>> workstations. The boxen are dual booted with Windows and
>>> GNU/Linux but
>>> they spend most of their time on GNU/Linux. Each box have 2Gb of RAM
>>> so I expected decent performance with local BLAST jobs but the sad
>>> truth is that my jobs are run about 4 times slower with blast2 than
>>> with blastcl3 with the same parameters. The hard drive is IDE so I
>>> suspect a bottle neck here.
>>>
>> Make sure the IDE drivers are configured to use DMA I/O, but if repeat
>> searches of a database are just as slow as the first time it is
>> searched,
>> then experience indicates the problem is that the amount of free memory
>> available is insufficient to cache the database files. Database file
>> caching is a tremendous benefit for blastn searches. If your jobs
>> too much
>> heap memory, though, no memory may be available for file caching.
>
>
> I often see caching problems if people have written their pipeline
> code incorrectly too; people naturally tend to write things like:
>
> foreach $seq (@sequences) {
> foreach $db (@databases) {
> system("blastn ...");
> }
> }
>
> which is, of course, exactly the wrong way round, and guarantees
> trashing the disk cache every single time.
>
> It's worthwhile to break your databases into chunks which are small
> enough for the entire thing to be cached on your compute nodes; until
> recently, we always broke nucleotide databases into 800 MB chunks.
> Of course, care then needs to be taken to get the statistics right
> when running lots of sequences against the individual chunks. If it
> fits your requirements, the automatic slicing that both blast
> flavours can do might work for you, but we do it manually.
>
>> Use of more threads requires more working (heap) memory for the search,
>> making less memory available to cache database files. If the
>> database files
>> aren't cached, more threads means more terribly slow disk head
>> seeking as
>> the different threads request different pieces of the database. If
>> heap
>> memory expands beyond the physical memory available, the system will
>> thrash.
>> With WU-BLAST, multiple threads are used by default, but if memory
>> is found
>> to be limiting, the program automatically reduces the number of threads
>> employed, to avoid thrashing.
>
>
> That's sensible - I didn't know it did that.
>
> Tim
>
More information about the Bioclusters
mailing list