[BiO BB] Find common regions in 3 organisms

Thu Sep 17 07:09:01 EDT 2009

----------------------------------------
> Date: Thu, 17 Sep 2009 11:25:58 +0100
> From:
> To: bbb at bioinformatics.org
> Subject: Re: [BiO BB] Find common regions in 3 organisms
>
> 2009/9/17 Nevan King :
>> Hi,
>>
>> This question has probably been asked, but I'm not sure what search
>> terms to use to find answers. This is a question from one of the
>> researchers in my lab.
>>
>> I want to find common regions of sequences in 3 organisms. The first
>> organism (P. gingivalis) has been fully sequenced and described. It
>> has around 2000 genes. The other two are similar to P. gingivalis.
>>
>> I've set up all three organisms in Blast, but comparing the genes one
>> by one would be a big task. What's the best way to automate this? I
>> understand that you can enter a list of fastas into blast and it will
>> compare each one to all the genes in its database. Is there a way to
>> do this with 3 organisms? Is Blast the best tool to use for this job?
>>
>> Sorry if this is short on details, I don't fully understand the topic.
>
> Often the answer to this sort of question is 'there is more than one
> way to do it', and the way that you use usually depends on what you
> want to see...
>
> I would suggest something like this:
>
> 1) blast all genes of organism A against organism B and vice verse
> (as described above).
>
> 2) Pick 'orthologues' using the 'reciprocal best hits' method (i.e. if
> gene Ax' and gene Bx'' both find each other as the 'top blast hit' in
> the respective organisms gene list, call them an orthologus pair.
>
> 3) Repeat step 1 and 2, but use organism A and C instead of A and B.
>
> 4) Pick 'orthologues' when Ax' and Bx'' are an orthologus pair AND Ax'
> and Cx''' are an orthologus pair.
>
> 5) er... do you need to do B vs. C?
>
>
>> Thanks
>>
>> Nevan.
>
> Once you get the above blast results (A vs. B, A vs. C, B vs. C and
> vice verse) into a database, you will have more than enough data to
> play with. You can then define orthologues however you like.
>

I guess if you need speed and have limited comparisons like this, it may
make more sense to index the 3 genomes and work from there. I don't remember exactly how BLAST works but indexing like this is a common
approach to speeding things up and gives you a lot of flexibility for
tweaking. If there are interesting exact matches in distant places you can find them more easily this way. I think this
approach has been mentioned here before with terms like "string matching"
or exact strings or something. I wrote something like this that I tested
on strains of IIRC e coli and while not production it seemed to find 
regions of difference and SNP's pretty fast. Once you have string 
index tables, you can decide which keys are "low complexity" or uninteresting etc.

Besides algorithmic order issues ( NxM ) there are also issues with 
memory access patterns. If you know apriori how you will access the
data and have finite memory ( even with multi-Gig RAM the cache size 
can be limited) you stand to make things a lot faster. So, if this
will be a recurring thing you may want to write or get some short c++
string utilities with data structs designed for your needs. 

Of course you can put arbitrary sequences into clustalw too...

Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
415-264-8477 (w)<- use this
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka at hotmail.com
Note: If I am asking for free stuff, I normally use for hobby/non-profit
information but may use in investment forums, public and private.
Please indicate any concerns if applicable.
Note: hotmail is censoring incoming mail using random criteria beyond my control and often hangs my browser
but all my subscriptions are here..., try also marchywka at yahoo.com

_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
http://clk.atdmt.com/GBL/go/171222985/direct/01/