[Bioclusters] Condor cluster and BLAST

Wed Jan 25 15:45:09 EST 2006

>> I am in a similar position with a new cluster in our lab. I am  
>> still very early in the learning curve and your comments for Sandy  
>> are very helpful. We have and all vs all Blast job that we would  
>> like to run. I am trying to get a handle on how best to run this

I heartily agree with Dave Adelson's comments.  My only additional  
caveat is to know what you're trying to accomplish biologically  
before throwing computation at the problem.  Are these ESTs that  
you're trying to contig?  A gene set from an organism in which you're  
looking for paralogs?  A chunked-up whole chromosome?

BioPerl is a great set of tools for scripting just about any search  
you might want to do.  The real trick is in picking the tool that's  
appropriate to the biological questions at hand.   There are lots of  
great tools out there.  For chromosome scale searches, MegaBlast is a  
great piece of software.

My advice in terms of BLAST speed is to get an estimate first:   
Format your sequence set as a target (using formatdb) and then run a  
search on a single machine against the first 10 sequences in your  
dataset.  Do the math and figure out how many CPU hours you're up  
against.  If you can get it done with an hour of work using vi,  
followed by an overnight run, there's no reason to spend a week  
writing a comprehensive solution.

> We are mainly interested in high homology hits

A nitpicky point:  homology is evolutionary relationship.  It's  
descent from a common ancestor.  Homology is therefore a boolean  
(true or false) sort of property.  A high degree of sequence  
similarity is frequently an indicator of homology, although it's  
neither necessary nor sufficient.

-Chris Dwan