[BiO BB] help with bacterial protein sequence comparisons

Fri May 18 01:50:12 EDT 2012

replying to myself again ... (posts appear with a lag, when the  opinion
may have already changed)

ok, that's basically what blast is doing. I didn't know about blast.

_http://en.wikipedia.org/wiki/BLAST_ (http://en.wikipedia.org/wiki/BLAST) 
most cited paper in the 90s !

But I wanted a small program that I understand and can easily  manipulate
for different formats and additional features and without  
copyright,licence,
so I wrote my own.

For the original problem (as I understood it) - I would send the blast  
output to a file and filter the
output for representatives. There should be a tool for that, (I don't  know)
Mine is again selfwritten. Hmm, sequences that are 99% identical to the  
original can't
be only 90% identical to each other, so maybe I misunderstood.

In a message dated 18.05.2012 04:17:35 Westeuropäische Normalzeit,  
Sterten at aol.com writes:

thinking more about it  ...

task:
---------------------------------------
given  two  strings of amino-acids, S1 and S2.
Typically the lengths are  ~100-~million  bytes for S1 and
~100M-~4B bytes for S2, where e.g. S2  is the list of  all
bacterial amino-acid sequences from genbank,  joined  together.

Given threshold D , find all triples(s1,s2,L) of  lengths L  and
subsequences s1 of length L from S1 and subsequences s2  of 
length L  from S2 such that p-value(|s1-s2|) < D .
Typically  this is done for  several Ds simultaneously,
basically trying to find  the n best subsequences  for some given n.

Solve that task as  quickly as   possible.
-----------------------------------

algorithm   suggestion:
----------------------------------------------------
for  all  length=(4?) subsequences s1 from S1 make a table T[s1] of  
addresses where  that subsequence occurs in S1

walk through  all length=4 subsequences s2  of S2
(load S2 into a cyclic 256  byte-buffer)

whenever s2 is marked in  T (T[s2]>0), check the  extended subsequences 
x1 bytes forward, x2 bytes  backward from s2 ,  
enter it into the best(L) lists for the various lengths   L=x1+x2+4.
Print those that are   <tolerance
---------------------------------------------------

plot   the best #matches(L) (=% similarity) in a chart

this  misses  triples (s1,s2,L) that have no 4-length 100%-match
but are  good in total  length  nevertheless

----------------------------------------------------------------

does  this program exist ? What is "blast" doing ?

amino acid frequencies  in genbank bacterial files:
(I just calculated those - just in case someone  is interested)

L 76 219736963 
A 65 212665642 
G 71 164123234  
V 86 155933244  
E 69 132851172 
I 73 130723255 
S 83  128052651 
R 82 123992485  
D 68 118503645 
T 84 117688310 
K  75 105966220 
P 80  97716246  
F 70  84543510 
N  78  81056701 
Q 81  79888596 
Y  89  64269341 
M  77  51573988 
H 72  45239258 
W 87   26873940 
C  67  20419909 

X 88 175383   
_______________________________________________
BBB mailing  list
BBB at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bbb