[BiO BB] Protein Datatypes for function prediction

Mike Marchywka marchywka at hotmail.com
Sat Jul 28 11:08:04 EDT 2007


>rapidly, don't dismiss even simple text processing.
[...]
>( a few thousand, enough to cluster perhaps)


I actually tried this with osteoglycins. If you download them, there aren't 
that many,
pickout repeated "words", and cluster by presence of absence of the most
popular words, it turns out to do a decent automated job of separating by 
species.
These are the vectors ( presence/absence of the words) along with members
having that vector ( names could be ambiguous ,for illustration only). I was 
hoping it would
separate by type but that is a problem using most common words to 
discriminate.
The zero vector amounts to a "miscellaneous" cluster.

$ for f in `cat osteo_groups | awk '{print $2}' ` ; do echo $f; g=`grep $f 
osteo_vectors|awk '{print $1}'| sed -e 's/>//'`; echo $g; h=`echo $g|sed -e 
's/\..*//g' |sed -e 's/  */\\\|/g'`; grep -A 2 "$h" osteo_rdict| grep 
"DEFINITION"| sed -e 's/DEFINITION//' ; done |unix2dos >/dev/clipboard

1111111111111111111111111011111111110001
CAI16694 AAH95443 AAH37273 NP_148935 NP_054776 ABM85338 ABM82153 EAW62820 
EAW62819 EAW62818 P20774 CAB53706
  osteoglycin [Homo sapiens].
  Osteoglycin [Homo sapiens].
  Osteoglycin [Homo sapiens].
  osteoglycin preproprotein isoform 2 [Homo sapiens].
  osteoglycin preproprotein isoform 2 [Homo sapiens].
  osteoglycin (osteoinductive factor, mimecan) [synthetic construct].
  osteoglycin (osteoinductive factor, mimecan) [synthetic construct].
  osteoglycin (osteoinductive factor, mimecan), isoform CRA_a [Homo
  osteoglycin (osteoinductive factor, mimecan), isoform CRA_a [Homo
  osteoglycin (osteoinductive factor, mimecan), isoform CRA_a [Homo
  Mimecan precursor (Osteoglycin) (Osteoinductive factor) (OIF).
  hypothetical protein [Homo sapiens].
1111111011111000100001011101001000001110
NP_032786 EDL41086 AAH21939 BAA06721 Q62000 BAE35995 BAC35462
  osteoglycin [Mus musculus].
  osteoglycin [Mus musculus].
  Osteoglycin [Mus musculus].
  osteoglycin precursor [Mus musculus].
  Mimecan precursor (Osteoglycin).
  unnamed protein product [Mus musculus].
  unnamed protein product [Mus musculus].
0000000000000000000000000000000000000000
CAK03681 NP_002336 O42235 NP_032464 NP_989507 NP_033885
  novel protein similar to vertebrate osteoglycin (osteoinductive
  lumican precursor [Homo sapiens].
  Keratocan precursor (KTN) (Keratan sulfate proteoglycan keratocan).
  keratocan [Mus musculus].
  keratocan [Gallus gallus].
  bone morphogenetic protein 1 [Mus musculus].
1101111111100000001001011100001000001110
EDL98110 XP_001054654 XP_001054599 XP_001054725 XP_214441
  osteoglycin (predicted) [Rattus norvegicus].
  PREDICTED: similar to Mimecan precursor (Osteoglycin) isoform 2
  PREDICTED: similar to Mimecan precursor (Osteoglycin) isoform 1
  PREDICTED: similar to Mimecan precursor (Osteoglycin) isoform 3
  PREDICTED: similar to Mimecan precursor (Osteoglycin) [Rattus
0000001000100110000100000000000000000000
NP_989540 AAD21085 Q9W6H0 Q9DE65
  osteoglycin [Gallus gallus].
  osteoglycin [Gallus gallus].
  Mimecan precursor (Osteoglycin).
  Mimecan precursor (Osteoglycin).
1111111111111111111111111011011111110001
AAP97142 Q5RBL2 CAH90848
  osteoglycin OG [Homo sapiens].
  Mimecan precursor (Osteoglycin).
  hypothetical protein [Pongo pygmaeus].
1110101111110110010001011001011111111110
NP_001075585 AAM46865 Q8MJF1
  osteoglycin [Oryctolagus cuniculus].
  osteoglycin [Oryctolagus cuniculus].
  Mimecan precursor (Osteoglycin).
1110111111100011110110111111011001100010
ABQ13007 P19879
  osteoglycin preproprotein [Bos taurus].
  Mimecan precursor (Osteoglycin) [Contains: Corneal keratan sulfate
1110011111100011110110111111001001100010
NP_776371 AAB70264
  osteoglycin [Bos taurus].
  mimecan [Bos taurus].
1111111111111111111111011011011111110000
NP_077727
  osteoglycin preproprotein isoform 1 [Homo sapiens].
1111011111101111111110111011001111100001
XP_001103337
  PREDICTED: osteoglycin isoform 2 [Macaca mulatta].
1111011111101111111110011011001111100000
XP_001103195
  PREDICTED: osteoglycin isoform 1 [Macaca mulatta].
1110111111110011010001111110011001010000
ABL96619
  osteoglycin [Capra hircus].
1101011111110010010111011100000001110110
XP_853340
  PREDICTED: similar to Mimecan precursor (Osteoglycin)
1100000111011110110111000011000101110000
CAB61417
  hypothetical protein [Homo sapiens].
1011111000100111011000011000011110000000
CAI16695
  osteoglycin [Homo sapiens].
1011111000100111001000111000011010000001
AAX25979
  SJCHGC07866 protein [Schistosoma japonicum].
0000001000100000000000000000000000000000
NP_001080164
  osteoglycin [Xenopus laevis].
0000000110000000000000000000000000000000
CAJ57655
  osteoglycin [Sus scrofa].
0000000000100100000000000000000000000000
XP_001512743
  PREDICTED: similar to osteoglycin preproprotein [Ornithorhynchus
0000000000000001000000100000000000000001
AAD40453
  mimecan [Homo sapiens].







Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka at hotmail.com





>From: "Mike Marchywka" <marchywka at hotmail.com>
>Reply-To: "General Forum at Bioinformatics.Org" 
><bio_bulletin_board at bioinformatics.org>
>To: bio_bulletin_board at bioinformatics.org
>Subject: Re: [BiO BB] Protein Datatypes for function prediction
>Date: Tue, 24 Jul 2007 07:50:50 -0400
>

_________________________________________________________________
http://newlivehotmail.com




More information about the BBB mailing list