[Bioclusters] Optimal database fragments in mpiBLAST

Lucas Carey bioclusters@bioinformatics.org
Fri, 14 Nov 2003 11:21:12 -0500


You want to use the same number of fragments as you have cpus, minus 1 for the master -- you'll want 11 fragments, assuming that each one is < 400MB. 
This is because there is a large penatly for each additional fragment in both merging (should be insignifigant overall) and searching. This is NCBI-BLAST specific, not just mpiBLAST specific, and can be seen simply by splitting the database up using the standard formatdb and running blast on it on a single cpu with blastall.
However, different fragments can take widly different times to process due to sequence complexity. Match extention is very expensive, and the frequency is obviously sequence specific. Increased numbers of fragments allow mpiBLAST to dynamically load balance. As one node finishes a fragment, which can be much faster for low sequence-complexity fragments than for high ones, mpiBLAST will give the worker a new fragment. If you were to split your database up into 100 fragments, you would incure a large pentalty for splitting up the database, (I forget the exact numbers, I can find them if you're interested), but load-balancing would be fairly close to perfect. The relative penatly increases as the fragment size decreases, so there's a second reason why you don't want your fragment size too small.
The final answer depends on the size of your queries and the size and sequence complexity variation in your database. If all fragments had approximatly the same number of hits and extentions, you're best bet would probably be 11 framents, assuming that will fit in memory. For real-world situation, I'd go with 3xWorkers to start out. If you run mpiBLAST with --debug you can see what's going on, and you can look for the time workers finish to get an idea of how well they're being load-balanced.
Hope this helps
-Lucas

On Fri, Nov 14, 2003 at 09:30:55PM +0530, Malay Kumar Basu wrote:
> Hello Gurus:
> 
> Master - cpu 2 Xeon with hyperthreading 2 GB RAM
> 4 x nodes each - cpu 1 P4 hyperthreading 1 GB RAM
> 
> SGE recognizes total 12 cpus.
> 
> When hyperthreading on the whole setup can have 12 cpus, otherwise 6. 
> What should be the optimal BLAST database fragments for mpiBLAST?
> 
> Malay
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters