GP

2000

NAME

gp_acc - Computate the auto-cross correlation values for a sequence

SYNOPSIS

gp_acc [-e ] [-l value] [ -p value] [-q] [-v] [-d] [-h] [inputfile] [outputfile]

OPTIONS

-e

Only encode the sequence, do not computate the ACC.

-l value

precede the output with a header containint the descriptions of variables, assuming a sequence length value.

-p value

maximal lag will be value. If this option is not used, the program sets the maximal lag to 1/3 of the length of the current sequence.

-v

Prints the version information.

-d

Prints lots of debugging information.

-h

Shows usage information.

inputfile

file to proces; if not given, will use standard input

outputfile

file to write the data to; if not given, will use standard output

DESCRIPTION

Note: currently only DNA/RNA sequences are supported.

Auto-cross correlation (ACC) is a way of converting a sequence into a set of variables which contain information useful for, for example, statistical analysis. In ACC, the sequence is encoded into a numerical values. Currently, this encoding assignes for each nucleotide three values (-1,-1,1 for A, 1,-1.-1 for C, -1,1,-1 for G, and 1,1,1 for T/U) of the three so-called descriptor variables. Next, the sequence is alligned with itself whith a lag (shift) equal to one, and covariance coefficients are computed for each pair of the three descriptors variables, together nine coefficients. Then the lag is increased by one and the procedure is repeated until the lag reaches the maximal lag value (being 1/3 of the sequence automatically, or optional value choosen by the user).

For each sequence computated, a row of data is produces, containing (maximal lag - 1) * number of descriptor variables * number of descriptor variables values. For example, for a nucleotide sequence with three descriptor variables (as described above) and a maximal lag of 20, you will get 171 values.

This doesn't probably make much sense to you unless you are familiar with such terms as PCA and PLS. You can find more information on this subject in S. Wold and M. Sjöström, 1998, "Chemometrics, present and future success", Chemometrics and Intelligent Laboratory Systems 44:3-14, and, by the same authors, 1985, "A multivariate study of the relationship between the genetic code and the physical-chemical properties of amino-acids", J. Mol. Evol. 22:272-7.

DIAGNOSTICS

All Genpak programs complain in situations you would also complain, like when they cannot find a sequence you gave them or the sequence is not valid.

The Genpak programs do not write over existing files. I have found this feature very useful :-)

BUGS

I'm sure there are plenty left, so please mail me if you find them. I tried to clean up every bug I could find.

AUTHOR

January Weiner III <january@bioinformatics.org>

GP