[Bioclusters] Grid BLAST

Tim Cutts tjrc at sanger.ac.uk
Mon Sep 26 04:31:13 EDT 2005


On 24 Sep 2005, at 7:40 pm, Warren Gish wrote:

>> Hi, I'm the administrator the bioinformatics laboratory at Université
>> du Québec à Montréal.  I have a room filled with dual P4 3GHz
>> workstations.  The boxen are dual booted with Windows and
>> GNU/Linux but
>> they spend most of their time on GNU/Linux.  Each box have 2Gb of RAM
>> so I expected decent performance with local BLAST jobs but the sad
>> truth is that my jobs are run about 4 times slower with blast2 than
>> with blastcl3 with the same parameters.  The hard drive is IDE so I
>> suspect a bottle neck here.
>>
> Make sure the IDE drivers are configured to use DMA I/O, but if repeat
> searches of a database are just as slow as the first time it is  
> searched,
> then experience indicates the problem is that the amount of free  
> memory
> available is insufficient to cache the database files.  Database file
> caching is a tremendous benefit for blastn searches.  If your jobs  
> too much
> heap memory, though, no memory may be available for file caching.

I often see caching problems if people have written their pipeline  
code incorrectly too; people naturally tend to write things like:

foreach $seq (@sequences) {
     foreach $db (@databases) {
         system("blastn ...");
     }
}

which is, of course, exactly the wrong way round, and guarantees  
trashing the disk cache every single time.

It's worthwhile to break your databases into chunks which are small  
enough for the entire thing to be cached on your compute nodes; until  
recently, we always broke nucleotide databases into 800 MB chunks.   
Of course, care then needs to be taken to get the statistics right  
when running lots of sequences against the individual chunks.  If it  
fits your requirements, the automatic slicing that both blast  
flavours can do might work for you, but we do it manually.

> Use of more threads requires more working (heap) memory for the  
> search,
> making less memory available to cache database files.  If the  
> database files
> aren't cached, more threads means more terribly slow disk head  
> seeking as
> the different threads request different pieces of the database.  If  
> heap
> memory expands beyond the physical memory available, the system  
> will thrash.
> With WU-BLAST, multiple threads are used by default, but if memory  
> is found
> to be limiting, the program automatically reduces the number of  
> threads
> employed, to avoid thrashing.

That's sensible - I didn't know it did that.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233



More information about the Bioclusters mailing list