[BiO BB] base counting

Thu Mar 16 05:35:57 EST 2006

Corné HW Klaassen wrote:

> Hi Peter,
> 
> Thanks for the quick reply. On paper this is exactly what I'm looking 
> for but ......I gave compseq a try and it doesn't seem to work on 
> features larger than 20 nt whereas I'm particularly interested in 
> features 40-140 nt (I realize that this can be a very computational 
> intensive job). Any other suggestions? Is there perhaps something 
> similar for protein sequences or on some other arbitrary units?

Depends on what you are looking for.

For very long features it would need a lot of data to identify a strange 
frequency.

Also, compseq needs a table for every possible n-mer which is rather high by 
the time you reach 20 bases.

You could try a shorter word size and look for overlaps. In the E.coli case, 
CTAG is low, and you can also compare TAGA TAGC TAGG TAGG to see which could 
be the less common 5mers.

EMBOSS also has:

wordcount, which reports the most frequent words of a given size. The memory 
used by wordcount depends on the size of the input (it works through all 
n-mers that actually appear, which would be close to 1 per base of input.

polydot, which plots word matches between 1 or more sequences and can report 
their locations. Frequent nmers show up readily off the main diagonal.

Looking at the wordcount output, it would be useful to set a minimum 
occurrence - it will report all words that appear once. For 40mers that means 
output is 40 times the original input length. I will do this for the next 
EMBOSS release!

Hope that helps,

Peter