[BiO BB] ortholog

Mike Marchywka marchywka at hotmail.com
Wed Sep 12 15:49:14 EDT 2007

>I'm trying to find out thousands of genes' ortholog from
>ENSEMBL. Seems hard to get a clear and direct way to achive it. Any
>suggestion is invited( or you can suggest a better database for orthologs

I'm not sure if these are competitive yet with the web based tools but I'm 
a bunch of scripts for automated search and analysis. If anyone cares to 
on strengths or limitations of existing tools it may help me fill some voids 
( make these things
useful to others).
Essentially everything uses
the NCBI eutils facilities supplemented with some local databases or rules. 
For example,
I have a bunch of scripts to find a contig in the dog genome and put it into
something called ex_fasta. Then, I bury a bunch of blast searches and text 
into a one-liner( the option names are a bit odd because I make them up out 
prior combinations as needed in a task-specific way ):

$progpath/findhomologues -de_novo_stuff ex_fasta

The above also creates a bmp file with a bunch of annotations and clustalw 
alignments between blast hits to various databases including some local 
repeat and probe collections.
Right now, I'm adding  a rule-based alignment and annotation system. I've 
got a collection of
Perl REGEX patterns in an XML file along with biblio info ( where it came 
from, etc) that
I can parse into something simple:
./yaxml -parse rule_source.xml -rules > algn_rules

$ cat algn_rules
ATG >rule|1|DNA Start Codon
(?<=TATA.*)(GT.*?AT)(?=.*ATAAA) >rule|4|DNA Composite Introns
ATG(...)*?(TAG|TAA|TGA) >rule|5|DNA Euk ORF
MGSGSSS >rule|9|PEPTIDE N-myristoylation pattern
[CA](AG|GTA|GTG)AGT >rule|10|DNA? splice donor
[CT]+[A-Z][CT]A{0,1}G >rule|11|DNA? splice acceptor
N[^P][ST][^P] >rule|14|PEPTIDE Glycosylation site
[ST].N. >rule|15|PEPTIDE Glycosylation site
Y..[LI].{6,8}Y..[LI] >rule|16|PEPTIDE ITAM,Fc cytoplasmic tail

And use for alignment cues:

$progpath/rules_annotater -clean -which 1 -fastas o2_fasta -rules 
$progpath/align_rules > r3nunu2

That then output in text or graphical bmp files either alignments of just 

$ $progpath/mm_align_tool -fastas o2_fasta -rules r3nunu -rules r3nunu2 
-use_rule 4 -stats -align -output notes
For Rules set 0:>ref|NW_876253.1|Cfa11_WGA39_2:47189155-47195387 Canis 
is chromosome 11 genomic contig, whole genome shotgun sequence
388        >rule|2|DNA Stop Codon
344        >rule|11|DNA? splice acceptor
189        >rule|4|DNA Composite Introns
128        >rule|1|DNA Start Codon
58         >rule|5|DNA Euk ORF
34         >rule|6|DNA Euk spliced ORF
15         >rule|12|DNA? polyadenlyation signal
6          >rule|3|DNA TATA box
3          >rule|10|DNA? splice donor
For Rules set 1:>gb|AACN010493556.1|:1-1146 Canis familiaris 
whole genome shotgun sequence
72         >rule|11|DNA? splice acceptor
60         >rule|2|DNA Stop Codon
24         >rule|1|DNA Start Codon
21         >rule|4|DNA Composite Introns
6          >rule|5|DNA Euk ORF
5          >rule|6|DNA Euk spliced ORF
1          >rule|10|DNA? splice donor
1          >rule|12|DNA? polyadenlyation signal
1          >rule|3|DNA TATA box

I'm still debugging this but initial alignment with rules was about what I 
expected, now I'm working
on automating the analysis and interpretation. I've also got a bunch of test 
scripts that, for example, grab two random and distinct pieces of dog or 
human genome and try to align or otherwise "match" them- handy for control 
and finding sequences that occur a lot.

Can you find the hidden words?  Take a break and play Seekadoo! 

More information about the BBB mailing list