GP

GP

April 2000

NAME

GP - utilities to manipulate DNA / RNA / protein sequences

Copyright (C) 2000 January Weiner III <january@bioinformatics.org>

GP homepage:

http://www.bioinformatics.org/genpak/)

LICENSE

GP is GPL'ed. Please read the file LICENSE.TXT for details.

DESCRIPTION

GP is a set of small utilities written in ANSI C to manipulate DNA sequences in a Unix fashion, fit for combining within shell and cgi scripts. Some exemplary cgi scripts are provided in the cgi directory. I have done this utilities for myself and found them very useful for my work; they are fast and quite reliable, and playing with large numbers of sequences is much more convenient then with standard GUI tools. Feel free to mail me bug reports and suggestions.

The sequences are usually in fasta format, that means the first line is the sequence name starting with ">", and the sequence comes in the next lines. The programs accept also gzipped sequence files (that is, if zlib support was defined at compile time, which is default).

Upon installation, GP creates a directory where it stores all it's data. As a default, it is the /usr/lib/genpak directory. If one of the programs cannot find a file which is given as the argument, it looks for it in this particular directory, and only if it is not there it exits with an error. You can put some shared files into this directory; note that they will not be erased upon deinstallation or reinstallation. However, in the latter case they might well get overwritten if you substituted the original GP files by your own.

All programs share some common options:

  • -h prints out a quick summary of options

  • -H output in HTML mode

    Some programs can print nicely formatted tables or produce some other HTML specialized output. All programs collect warning messages and display them as the last thing before exiting. Do not use this option if you intend to feed other programs with the standard output.

  • -v prints version information

  • -q supresses all error messages ("quiet")

  • -d prints out debugging information

  • Most programs accept also standard input (that was one of the main points why I wrote those utilities anyway), and per default spawn the results to standard output. This way, you have several methods of accessing the programs:

    cat sequence.fasta | program > program.output

    some_other_program | program | yet_another_program

    program input.file output.file

    program

    In the latter case, you have to type in or paste any data the program expects to find on the standard input, and the program spawns the processed data directly on the screen.

    In most cases, you can use multiple sequences stored in one file in a fasta format fashion. The programs which require a sequence file will work until all the sequences that can be retrieved from an input (=file or standard input) are processed.

    LIST OF PROGRAMS

  • gp_qs
  • gp_getseq
  • gp_gc
  • gp_map
  • gp_tm
  • gp_matrix
  • gp_mkmtx
  • gp_shift
  • gp_randseq
  • gp_cusage
  • gp_seq2prot
  • gp_findorf
  • gp_slen
  • gp_dimer
  • gp_trimer
  • gp_pattern
  • gp_primer
  • gp_acc
  • gp_scan
  • gp_pars
  • Here are the short program descriptions. Take a look at their respective manual pages or html documentation to obtain more informations.

  • gp_qs

    find fast a sequence within a larger sequence, and print out the positions. Sometimes you just don't need blasta -- like, when you want only to know where exactly your primer binds in a given sequence. You can either type the sequence directly as a command line argument, like

    gp_qs ACTGACTG [sequence filename]

    or give a filename in command line as an argument.

  • gp_getseq

    retrieves quickly a sequence fragment. Usage is simple: gp_getseq Position1 Position2 [sequence filename] Note that if Position2 > Position1, the retrieved sequence is complementary to the fragment Position1...Position2. Position1 is the number of the first base to be retrieved, and Position2 is the last base to be retrieved.

  • gp_gc

    Prints out the GC content of a given sequence or sequences. Can also computate mean and SE for larger number of sequences.

  • gp_map

    gp_map generates automatically graphical gene maps. You provide a simple input -- a list of genes, their positions, maybe some parameters -- and the program outputs a PNG graphics showing the gene map. If the -H option is specified, additionaly an IMAP file is created: this allowes the creation of clickable, graphical maps created on the fly.

  • gp_tm

    Prints out the Tm of a given sequence. Three algorithms can be used: the exact nearest neighbor algorithm, the approximate GC contents algorithm, and the evil and false 4*[GC] + 2*[AT] algorithm.

  • gp_matrix

    Matrix is a program to look for promoters in a set of sequence files, using the Staden matrix (see: Hertz, G. and Stormo, G.D. 1996. Escherichia coli promoter sequences: analysis and prediction. Meth. Enzym. 273). Basically, you have a matrix file containing scores and penalties for nucleotides at different positions in the supposed -35 and -10 boxes, as well in the +1 region of a given sequence (see the file "matryca" in the data/ directory, which is the same as the E. coli matrix published in Hertz et al.).

    The program loads sequences from the sequence file, and then scans it using all possible combinations of gap lengths between the +1, -10 and -35 boxes and at all possible positions in the sequence so as to find this combination which gives the highest score for the sequence. It then prints a formatted output in the following form:

    #score sequence...[-35 core]...[-10 core]...[start]...

    The '|' characters denote the boundaries of matrix'ed fragments.

    In the "data" directory you will find the original Staden E. coli matrix. The myco.mtx Mycoplasma pneumoniae matrix and the program have been described in Weiner, J. et al. 2000, "Transcription in Mycoplasma pneumoniae".

  • gp_mkmtx

    creates nucleotide frequency matrices, such as that which are used by the gp_matrix program.

  • gp_shift

    sometimes you have a list of genes:

    
    		100000 101000 gene1
    		200000 201000 gene2
    		400000 391000 gene3
    		...
    

    ...and would like to, for example, print out the promoter regions, that is, sequences from -100 to +10 relative to the 5'-end of the genes. gp_shift is useful for this.

  • gp_randseq

    unless the option -r is set, it prints out random fragments from a sequence file. Default fragment length is 100, and you can change it with the option -l length. If you set -r, however, completly random sequences are provided. You can determine their GC content with the option -g value. There is also an option -m, which stands for "Markov chains", but all it does is to assure that the probability of selecting a nucleotide depends on what is the previous nucleotide; this probabilities are also taken out from a sequence file.

  • gp_seq2prot

    Converts a nucleotide sequence to protein sequence. Sequence is supposed to start with a start codon: this is mandatory. Lacking of the stop codon or premature end of input sequence (like, in the middle of a codon) results only in a warning message.

    You can provide your own codon tables; for the format of the codon_file look at data/standard.cdn and data/myco.cdn. Basically, you need not to provide the whole table, it is enough to point out the differences. To see a codon file, type gp_seq2prot -p.

  • gp_findorf

    Prints out all ORFs that are contained in a sequence. gp_findorf looks always for the longest ORF within the given limit. See also notes for gp_seq2prot.

  • gp_cusage

    Prints out the codon usage of sequence(s). Same options as in the case of gp_seq2prot; actually -- this *is* nearly the same program. I just like them to have separately.

  • gp_slen

    Sequence length. Sometimes useful. Can also computate mean and SE of a set of sequences.

  • gp_dimer

    record frequencies of nucleotide pairs: AA, AC, AG...TT. This is sometimes useful for characterizing a sequence. You can also record frequencies of nucleotide pairs separated by a given number of nucleotides, to check, for example, how often an 'A' comes five nucleotides downstream of an 'T'. Believe me or not, it is useful.

  • gp_trimer

    record frequencies of nucleotide trimers: AAA, AAC, AAG...TTT.

  • gp_pattern

    record frequencies of patterns of a given length. Note that the number of possible patterns increases exponentially with each basepair, that is, for a tetramere there are 4^4 = 256 possible patterns.

  • gp_primer

    calculate oligonucleotide stem/loop and dimere structural parameters. This is what most of the web pages and programs like "Oligo" do. The set of thermodynamic parameters used here comes from a paper by SantaLucia et al.

  • gp_acc

    this program can be used to convert a sequence into a set of so-called auto-cross-correlation coefficients which can be further analised by, for example, principle component analysis (PCA). If you want to learn more about it, read Jonsson et al., 1991, "A multivariate representation and analysis of DNA sequence data".

  • gp_scan

    gp_scan is used to further analyse the auto-cross-correlation terms to find out some more information about patterns or regularities using in sequence.

  • gp_pars

    This program shows that I'm hopeless and don't know anything about Un*x tools. All pars does is to change the "%0D%0A" string into a newline character, because I couldn't find a way around that using sed(1).

  • THANKS

    Many thanks go to all good souls from comp.lang.c, whose advice was necessary to do all those programs and to, and Hinrich W. H. Göhlmann and Steve Brewer for ecouraging me in my work.

    NOTE FROM AUTHOR

    I'm not a programmer, and GP is amateur work. Everything started because I found myself constantly writing small utilities which could do batch jobs for me, instead of using packages like DNA Star. Graphical user interface is OK, as long you don't have to process like 677 sequences -- and 677 is a number which occurs often during my work, because it is the number of genes in the Mycoplasma pneumoniae genome I am working on. There are also many Unix tools, but they are either hard to use, or to install, or do not even compile on my Linux boxes.

    Originally, the package name was GP, but there is some company named like that, so I changed most of the names to GP.

    The programs, I'm sure, have lots of bugs and poor code. For example, I never got the Makefile to work properly. So if you can help me make GP a little better, do so -- and mail me.

    SEE ALSO

    gp_acc(1) gp_cusage(1) gp_digest(1) gp_dimer(1) gp_findorf(1) gp_gc(1) gp_getseq(1) gp_map(1) gp_matrix(1) gp_mkmtx(1) gp_pattern(1) gp_primer(1) gp_qs(1) gp_randseq(1) gp_seq2prot(1) gp_slen(1) gp_tm(1) gp_trimer(1)

    DIAGNOSTICS

    All Genpak programs complain in situations you would also complain, like when they cannot find a sequence you gave them or the sequence is not valid.

    The Genpak programs do not write over existing files. I have found this feature very useful :-)

    BUGS

    I'm sure there are plenty left, so please mail me if you find them. I tried to clean up every bug I could find.

    AUTHOR

    January Weiner III <january@bioinformatics.org>