Alignments in HTML from the command line

Interactive example of overlapping residue annotations. The first sequence has two residue selections indicated by cyan and red background. The second sequence exhibits two residue selections which are shown as red and green underlining. The text information pops up when the mouse is moved.

This page describes generation of alignment documents with commands from the UNIX shell. Please note that there is an interactive web service with an API and that the graphical Java program Strap also provides HTML export.

Reproducing the example

Download:

      BASE=http://www.bioinformatics.org/strap
      FILE=$BASE/strap.jar
      wget -N $FILE || curl -O $FILE
      FILE=$BASE/scripts/toHTML1.txt
      wget -N $FILE || curl -O $FILE
      FILE=$BASE/toHTML/data/fly_temp.gif
      wget -N $FILE || curl -O $FILE
      FILE=$BASE/aa/alignment2html.jar
      wget -N $FILE || curl -O $FILE

If the internet is accessed via web proxy:

        export JavaProxy=' -DproxyHost=proxy.institution.org -DproxyPort=8080 -Dhttp.nonProxyHosts="" '

Test the proxy settings. You should see google's html code.

        java $JavaProxy -jar strap.jar -testWeb http://www.google.com

Create the HTML alignment in the figure:

 java  $JavaProxy  -jar strap.jar -script=toHTML1.txt  -toHTML=myOutput.html

The output file myOutput.html is ready to be displayed in a web browser.

alignment2html.jar is faster than strap.jar

The program strap.jar is a command line tool and a desktop application. In contrast, alignment2html.jar which is built from the same source is lighter and faster because it does not include the GUI classes. Furthermore MS-Windows support and software installation at run-time is deactivated. Another difference is that for similarity search, it uses local Blat which is much fater than Blast with the consequenence that the databases need to be installed locally. The manual is displayed with the option -help

      java -jar alignment2html.jar -help

The script file

The lines in the script file toHTML1.txt are sequentially executed. Lines, which start with a hash character are ignored. An alphabetic list of all commands is printed with the command line option

-help=script

. Also see Scripting language. The following three lines specify the number of characters per line, the minimal conservation of a residue position to be emphasized in bold face and the residue color mode.

      set_characters_per_line 24
      set_conservation_threshold 70
      set_color_mode chemical

Three sequences are created: Canus and Xenopus and Drosophila. Dashes denote alignment gaps. Dashes would be not not required if the alignment was computed with the command align *.

      aa_sequence MVLSAADKGNVKAAWGKVGGHAAEYGAEALERMFLSFPTTKTYFP, Canus
      aa_sequence -VLSAAERAQVKAAWGKI--QAGAHGAEALERMFLGFPTTKTYPF, Xenopus
      aa_sequence MILSAAERAQIKAAWGKVG-NAGAHGAEALD--FLGYPTTKSYPY, Drosophila

Assigning the cleaved initial Methionine the index 1, the Xenupus sequence starts with amino acid number 2:

 set_residue_index_offset 1, Xenopus

The protein image icons are shown in the alignment row header. For the first two icons the URL is given. The last icon is loaded from a local file which is not accessible from other computers and the image data is included into the HTML file:

      icon  http://www.goldenweb.it/software/immagini/icone/animals/water_animals/Frog.gif, Xenopus
      icon  http://www.goldenweb.it/software/immagini/icone/animals/misc_animals/dog1.gif,  Canus
      icon fly_temp.gif, Drosophila

If the database accession ID is given, a blue asterisk after the sequence name acts as a hyper-link:

      accession_id  UNIPROT:P0A7B8 , Canus

Residue selections are created with the command new_selection. Two display styles are supported: STYLE_BACKGROUND and STYLE_UNDERLINE. Color, display style, balloon-text and web-links are specified with add_annotation or set_annotation.

      new_selection  1-4,                                               Canus/N-terminus 
      set_annotation Hyperrefs=http://en.wikipedia.org/wiki/N-terminus, Canus/N-terminus 
      set_annotation Style=STYLE_BACKGROUND,                            Canus/N-terminus 
      set_annotation Color=#00ffFF,                                     Canus/N-terminus 
      add_annotation Balloon=Balloon text blablabla,                    Canus/N-terminus

The command set_annotation overrides any previous value, whereas add_annotation keeps already existing lines.

A description of all commands is obtained with the program parameter -help=script.

Splice variants of Hexokinase. The size of the alignment exceeds the window size and therefore it can be scrolled. These sequences are loaded from nucleotide sequence files. Therefore the coding triplet and exon number of the amino acid under the mouse pointer is shown.

3D-Visualization

Java-applets (OpenAstex) for 3D visualization are included automatically if 3D-coordinates are provided. Both views are linked: Clicking an aminoacid in the alignment or the 3D-view will highlight the respective residue in the other view. If proteins are loaded from files in PDB format then 3D-coordinates are taken directly from that file. Otherwise, a PDB model of an identical or at least homologous protein can be manually or automatically associated. The command project_coordinates takes either the PDB identifier in the form PDB:1sbc or PDB:1sbc_A (chain A) or an URL of a (compressed or uncompressed) protein file or the keyword AUTO. Residue mismatches between the sequence and the 3D-model are optionally shown. The following command will automatically identify homologous structures for all proteins (asterisk) using BLAST.

project_coordinates AUTO, *

3D-styles

Rendering Styles of Atoms in the 3D-visualization can be changed in two alternative ways:

3D-commands can be added to residue selections using the key "3D_view".
```
add_annotation 3D_view=3D_spheres, Canus/N-terminus
```
All annotations with the key 3D_view are sequentially executed.
3D viewes can be created for one or more proteins. These views can be referenced by their name. First the view and specific peptide chains of the view can be activated. Next the amino acids and atoms can be selected and finnally a 3D-style can be applied.
```
open_3D NameOfView , List of proteins  
```
If the 3D view contains more than one peptide chain, one of them can be specified with select_3D.
```
select_3D NameOfView , one_protein  
```
All 3D commands have the prefix "3D_" and often resemble Rasmol and Jmol. The command 3D_select selects amino acids or atoms. It should not be mixed up with the command select_3D.
```
3D_select 20-100 
```
Optional, the atom types such as carbon alpha and carbon beta can be appended.
```
3D_select 20-100.CA.CB 
```
The style of these atoms can be changed with commands starting with the prefix "3D_".
```
3D_spheres on
```

These 3D-commands are independent of the 3D software, currently OpenAstex. They will still be valid, even if another 3D-view will be supported in the future.

Specifying sequences

In the example, sequences are defined with the command aa_sequence, which accepts amino acid sequences with or without gaps. Alternatively, a local or remote file or database entry can be loaded.
Example with database references:

load UNIPROT:P49722 UNIPROT:P0A272 PDB:1ryp_C

Example with URLs:

load http://www.bioinformatics.org/strap/dataFiles/hs_HelicobacterPylori.swiss http://www.bioinformatics.org/strap/dataFiles/hs_SalmonellaTyphi.swiss

A subsequence rather than the entire amino acid sequence may be anticipated. The residue index intervall is appended after an exclamation mark. One of both intervall boundaries can be omitted. Example:

load UNIPROT:P49722!30-60

Optionally, a protein name can be given after a vertical bar. Example:

load PDB:1ryp_C|My_name

The name can contain the following variables: $ORGANISM, $ORGANISM_SCIENTIFIC, $ORGANISM5 (E.g. "DroMe" for Drosophila melanogaster), $NAME (The original name), $PDB, $SP (Swissprot name like "hslv_ecoli") and $SP1 (First part of Swissprot name like "hslv").

Nucleotide sequences are translated to amino acid sequences knowing the strand orientation and exon boundaries. This information is either contained in EMBL or Genbank formated nucleotide files or is given with the command cds. This is demonstrated below and explained in Scripting language.

Alignment computation

In the above example, a precomputed multiple sequences alignment is directly defined with aa_sequence. Alternatively, the alignment can be computed with the align command:

align *

The wildcard "*" or ".*" means all sequences. Alternatively, a space separated list of protein names, database IDs and regular expressions matching protein names will be accepted. By default ClustalW (Precompiled binary for Intel) and CE/CL (Java) will be used. The 3D-alignment program TM-align (Fortran) is faster than CE/CL. You could install TM-align from the software manager of your computer. Under Debian:

 apt-get install clustalw
    tm-align

Alternatively install a Fortran compiler. Then add the program option

-a3d=tm_align

. There are a few alternatives to ClustalW, some of which produce more accurate results but require more time. They will be expected in the /usr/bin/ directory for example /usr/bin/t_coffee. They can also be automatically loaded and installed. The unattended software installation from source code requires the software installation tools make and C++.

BioDAS annotations

Annotations are loaded for all sequences (Asterisk) or for a list of sequences with a command like

DAS_features CSA%20-%20extended uniprot cbs_total netphos netoglyc , *

and the GFF features from the Expasy server are loaded with

GFF_expasy_features *

The "%20" in the feature name is the hexadecimal character code for white space. After loading the data from the remote servers, the sequence positions are underlined in the alignment. The DAS-annotation providers are listed in the standard BioDAS registry file or in supplementary registry files given at the command line. Underlining these sequence annotations is time consuming. At least the identification of the UNIPROT identifier, can be accellerated by a local BLAST database and a local Uniprot as described below.

Program features by examples

Loading / creating sequences

Sequences can be loaded by Uniprot-ID, PDB-ID or URL.
Alignment Script
Sequence names can be specified in the load command.
Alignment Script
The alignment can be directly specified by sequences with gaps.
Alignment Script
The alignment can be read from an alignment file. Here the PFAM entry ... is loaded
Alignment Script

3D

3D-views are automatically included in the HTML output for all sequences with 3D-coordinates Additional 3D views can be defined with the command open_3D.

For sequences loaded directly from PDB files, a 3D-Java applet can be opened.
Alignment Script
A homologous PDB structure can be associated with a sequence. Amino acids without 3D-coordinates are written in lower case. Residue mismatches of the 3D-model are underlined with a check-box.
Alignment Script
The optimal (most similar) PDB entry can be detected automatically.
Alignment Script
3D-Structures of homologous proteins can be superposition
Alignment Script
See the text label in the 3D-view "HB-Beta-chain". 3D-Labels require specification of the atom type to avoid that all atoms of the amino acid are labeled.
Alignment Script
3D-views are distinguished by their name, here "View1" and "View2".
Alignment Script
3D-commands can be attached to sequence features loaded from DAS servers.
Alignment Script

Annotated residue selections

The amino acid positions can be specified with different numbering systems:
- Natural numbering 1,2,3, ... of amino acids
- PDB residue numbering
- Nucleotide numbers (If the aminoacid sequence is translated from a nucleotide sequence.)
Alignment Script
PDB insertion codes are capital letters occasionally attached to PDB residue numbers. In this example there are amino acids with the same PDB residue 187. They are distinguished by different insertion codes.
Alignment Script
Residue selections can have atom selector lines which specify the atoms for all subsequent style commands.
Alignment Script

Nucleotide sequences

CDS field in EMBL or Genbank files specifies the nucleotide reading orientation and exon boundaries. The sequence coding can also be specified in the script.
Alignment Script
The nucleotide sequence and the CDS expression can be provided in the script.
Alignment Script
Several splice variant of one nucleotide sequence can be displayed simultaneously.
Alignment Script
A positive or negative nucleotide index offset affects underlined nucleotide selections
Alignment Script
Clipping terminal amino acids does not affect underlined nucleotide selections.
Alignment Script

Unless the amino acid sequence is explicitely provided either with the command aa_sequence or in the field "/translation=" of a Genbank or Embl formated file, the amino acid sequence is predicted using the default genetic code. In rare cases the prediction will be wrong due to a different genetic code ( Stop-codon instead of Tryptophane) or mRNA editing.

Annotation services

Sequence features are a certain type of residue selections. In the html output the respective sequence positions are underlined with a color specific for the feature name. They can be shown and hidden with check-boxes. Sequence features are loaded from external services or created explicitely in the script file.

www.expasy.org provides annotations for Uniprot entries in General Feature Format (GFF) format.
Alignment Script
Uniprot IDs are required to load the information from annotation services. Determining IDs automtically and underlining annotations.
Alignment Script
Underlining annotations from different DAS servers
Alignment Script
Removing underlinings like "Non_cytoplasmic_region".
Alignment Script
If an homologous 3D-structure exists then the selected residues can be highlighted in 3D.
Alignment Script
A sequence feature can be explicitely defined in the script file without web services as follows: (I) A residue annotation is created with the command new_annotation. But instead of setting a color directly, a color is assigned to the name of the selection with feature_colors.
Alignment Script

Sequence groups

Sequence groups are named sets of sequences. Each sequence group has a button to select or deselect the respective sequences.

The sequences are grouped into alpha and beta hemoglobin chains.
Alignment Script

Generating all examples

If strap.jar is downloaded and the web proxy is written to the variable JavaProxy then all examples in this page can be generated. The program keeps data in $HOME/.StrapAlign and will therefore run much faster next time.

 for i in   load_ID_URL  load_set_name  aa_sequence  load_pfam  3d_loading_PDB  3d_infer_specific  3d_infer_auto  3d_superimpose  3d_label  3d_two_views  3d_das  anno_numbering  anno_insertionCode  anno_atomType  nuc_GI_CDS  nuc_EMBL_CDS  nuc_spliceVariantsHK  nuc_offset  nuc_cutNterm  DAS_expasy  DAS_expasy_blast  DAS_das  DAS_delete_long  DAS_own  groups_HB ; do 
      FILE=http://www.bioinformatics.org/strap/toHTML/scripts/$i.txt
      wget -N $FILE || curl -O $FILE
    java  $JavaProxy  -jar strap.jar -script=$i.txt  -toHTML=$i.html || break
 done

Contact

christophgil

goog

lemail

com