The FINDPATTERN language has been developed by the GCG developers
in order to allow a more flexible description of nucleic acid / protein patterns.
Implied Sets and Repeat Counts
Parentheses () enclose one or more symbols that can be repeated some number
of times. Braces {} enclose numbers that tell how many times the symbols
within the preceding parentheses must be found.
Sometimes, you can leave out part of an expression. If braces appear
without preceding parentheses, the numbers in the braces define the number
of repeats for the immediately preceding symbol. One or both of the
numbers within the braces may be missing. For instance, the pattern
GATG{2,}A means GAT, followed by G repeated from 2 to 350,000 times,
followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0
to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT,
followed by TG repeated from 0 to 2 times, followed by A. (If the pattern
in the parentheses is an OR expression (see below), it cannot be repeated
more than 2,000 times.)
OR Matching
If you are searching nucleic acids, the ambiguity symbols defined in
Appendix III let you define any combination of G, A, T, or C. If you are
searching proteins, you can specify any of several symbol choices by
enclosing the different choices in parentheses and separating the choices
with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A
followed by S. The length of choices need not be the same, and there can
be up to 31 different choices within each set of parentheses. The pattern
GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from
1 to 4 times followed by A. The sequence GATTGGA matches this pattern.
There can be several parentheses in a pattern, but parentheses cannot be
nested.
NOT Matching
The pattern GC~CAT means GC, followed by any symbol except C, followed by
AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T,
followed by CC.
Begin and End Constraints
The pattern <GACCAT can only be found if it occurs at the beginning of the
sequence range being searched. Likewise, the pattern GACCAT> would only be
found if it occurs at the end of the sequence range.