[BiO BB] base counting
Peter Rice
pmr at ebi.ac.uk
Thu Mar 16 05:35:57 EST 2006
Corné HW Klaassen wrote:
> Hi Peter,
>
> Thanks for the quick reply. On paper this is exactly what I'm looking
> for but ......I gave compseq a try and it doesn't seem to work on
> features larger than 20 nt whereas I'm particularly interested in
> features 40-140 nt (I realize that this can be a very computational
> intensive job). Any other suggestions? Is there perhaps something
> similar for protein sequences or on some other arbitrary units?
Depends on what you are looking for.
For very long features it would need a lot of data to identify a strange
frequency.
Also, compseq needs a table for every possible n-mer which is rather high by
the time you reach 20 bases.
You could try a shorter word size and look for overlaps. In the E.coli case,
CTAG is low, and you can also compare TAGA TAGC TAGG TAGG to see which could
be the less common 5mers.
EMBOSS also has:
wordcount, which reports the most frequent words of a given size. The memory
used by wordcount depends on the size of the input (it works through all
n-mers that actually appear, which would be close to 1 per base of input.
polydot, which plots word matches between 1 or more sequences and can report
their locations. Frequent nmers show up readily off the main diagonal.
Looking at the wordcount output, it would be useful to set a minimum
occurrence - it will report all words that appear once. For 40mers that means
output is 40 times the original input length. I will do this for the next
EMBOSS release!
Hope that helps,
Peter
More information about the BBB
mailing list