Alignments in HTML from the command line
This page describes generation of alignment documents with commands from the UNIX shell.
Please note that there is an interactive web service with an API and that the graphical Java program Strap also provides HTML export.
Reproducing the example
Download:
BASE=http://www.bioinformatics.org/strap
FILE=$BASE/strap.jar
wget -N $FILE || curl -O $FILE
FILE=$BASE/scripts/toHTML1.txt
wget -N $FILE || curl -O $FILE
FILE=$BASE/toHTML/data/fly_temp.gif
wget -N $FILE || curl -O $FILE
FILE=$BASE/aa/alignment2html.jar
wget -N $FILE || curl -O $FILE
export JavaProxy=' -DproxyHost=proxy.institution.org -DproxyPort=8080 -Dhttp.nonProxyHosts="" '
Test the proxy settings. You should see google's html code.
java $JavaProxy -jar strap.jar -testWeb http://www.google.com
Create the HTML alignment in the figure:
java $JavaProxy -jar strap.jar -script=toHTML1.txt -toHTML=myOutput.html
The output file myOutput.html is ready to be displayed in a web
browser.
alignment2html.jar is faster than strap.jar
The program strap.jar is a command line tool and a desktop application.
In contrast, alignment2html.jar which is built from the same source is lighter and faster because it does not include the GUI classes.
Furthermore MS-Windows support and software installation at run-time is deactivated.
Another difference is that for similarity search, it uses local Blat which is much fater than Blast with the
consequenence that the databases need to be installed locally.
The manual is displayed with the option -help
java -jar alignment2html.jar -help
The script file
The lines in the script file toHTML1.txt are sequentially executed.
Lines, which start with a hash character are ignored. An alphabetic list of all commands is printed with the command line option -help=script
. Also see Scripting language.
The following three lines specify the number of characters per line, the minimal conservation of a residue position to be emphasized in bold face and the residue color mode.
set_characters_per_line 24
set_conservation_threshold 70
set_color_mode chemical
Three sequences are created: Canus and Xenopus and
Drosophila. Dashes denote alignment gaps. Dashes would be not not required
if the alignment was computed with the command
align *.
aa_sequence MVLSAADKGNVKAAWGKVGGHAAEYGAEALERMFLSFPTTKTYFP, Canus
aa_sequence -VLSAAERAQVKAAWGKI--QAGAHGAEALERMFLGFPTTKTYPF, Xenopus
aa_sequence MILSAAERAQIKAAWGKVG-NAGAHGAEALD--FLGYPTTKSYPY, Drosophila
Assigning the cleaved initial Methionine the index 1, the Xenupus sequence starts with amino acid number 2:
set_residue_index_offset 1, Xenopus
The protein image icons are shown in the alignment row header. For the first two icons the URL is given.
The last icon is loaded from a local file which is not accessible from other computers and the image data is included into the HTML file:
icon http://www.goldenweb.it/software/immagini/icone/animals/water_animals/Frog.gif, Xenopus
icon http://www.goldenweb.it/software/immagini/icone/animals/misc_animals/dog1.gif, Canus
icon fly_temp.gif, Drosophila
If the database accession ID is given, a blue asterisk after the sequence name acts as a hyper-link:
accession_id UNIPROT:P0A7B8 , Canus
Residue selections are created with the command new_selection.
Two display styles are supported: STYLE_BACKGROUND and
STYLE_UNDERLINE.
Color, display style,
balloon-text and web-links are specified with add_annotation or
set_annotation.
new_selection 1-4, Canus/N-terminus
set_annotation Hyperrefs=http://en.wikipedia.org/wiki/N-terminus, Canus/N-terminus
set_annotation Style=STYLE_BACKGROUND, Canus/N-terminus
set_annotation Color=#00ffFF, Canus/N-terminus
add_annotation Balloon=Balloon text blablabla, Canus/N-terminus
The command set_annotation overrides any previous value, whereas add_annotation keeps already existing lines.
A description of all commands is obtained with the program parameter -help=script.
3D-Visualization
Java-applets (OpenAstex) for 3D visualization are included automatically if 3D-coordinates are provided.
Both views are linked: Clicking an aminoacid in the alignment or the 3D-view will highlight the respective residue in the other view.
If proteins are loaded from files in PDB format then 3D-coordinates are taken directly from that file.
Otherwise, a PDB model of an identical or at least homologous protein can be manually or automatically associated.
The command project_coordinates takes either the PDB
identifier in the form PDB:1sbc or PDB:1sbc_A (chain A) or an URL
of a (compressed or uncompressed) protein file or the keyword
AUTO. Residue mismatches between the sequence and the 3D-model are optionally shown.
The following command will automatically identify homologous structures for all proteins (asterisk) using BLAST.
project_coordinates AUTO, *
3D-styles
Rendering Styles of Atoms in the 3D-visualization can be changed in two alternative ways:
-
3D-commands can be added to residue selections using the key "3D_view".
add_annotation 3D_view=3D_spheres, Canus/N-terminus
All annotations with the key 3D_view are sequentially executed.
-
3D viewes can be created for one or more proteins. These views can be referenced by their name.
First the view and specific peptide chains of the view can be activated.
Next the amino acids and atoms can be selected and finnally a 3D-style can be applied.
open_3D NameOfView , List of proteins
If the 3D view contains more than one peptide chain, one of them can be specified with select_3D.
select_3D NameOfView , one_protein
All 3D commands have the prefix "3D_" and often resemble Rasmol and Jmol.
The command 3D_select selects amino acids or atoms. It should not be mixed up with the command select_3D.
3D_select 20-100
Optional, the atom types such as carbon alpha and carbon beta can be appended.
3D_select 20-100.CA.CB
The style of these atoms can be changed with commands starting with the prefix "3D_".
3D_spheres on
These 3D-commands are independent of the 3D software, currently
OpenAstex. They will still be valid, even if another 3D-view will
be supported in the future.
Specifying sequences
In the example, sequences are defined with the command aa_sequence, which accepts amino acid sequences with or without gaps.
Alternatively, a local or remote file or database entry can be loaded.
Example with database references:
load UNIPROT:P49722 UNIPROT:P0A272 PDB:1ryp_C
Example with URLs:
load http://www.bioinformatics.org/strap/dataFiles/hs_HelicobacterPylori.swiss http://www.bioinformatics.org/strap/dataFiles/hs_SalmonellaTyphi.swiss
A subsequence rather than the entire amino acid sequence may be anticipated. The residue index intervall is appended after an exclamation mark.
One of both intervall boundaries can be omitted. Example:
load UNIPROT:P49722!30-60
Optionally, a protein name can be given after a vertical bar. Example:
load PDB:1ryp_C|My_name
The name can contain the following variables: $ORGANISM, $ORGANISM_SCIENTIFIC, $ORGANISM5
(E.g. "DroMe" for Drosophila melanogaster), $NAME (The original name),
$PDB, $SP (Swissprot name like "hslv_ecoli") and $SP1 (First part of Swissprot name like "hslv").
Nucleotide sequences are translated to amino acid sequences knowing the strand orientation and exon boundaries.
This information is either contained in EMBL or Genbank formated nucleotide files or is given with the command cds.
This is demonstrated below and explained in Scripting language.
Alignment computation
In the above example, a precomputed multiple sequences alignment is directly defined with aa_sequence.
Alternatively, the alignment can be computed with the align command:
align *
The wildcard "*" or ".*" means all sequences. Alternatively, a
space separated list of protein names, database IDs and regular
expressions matching protein names will be accepted.
By default ClustalW (Precompiled binary for Intel) and CE/CL (Java) will be used.
The 3D-alignment program TM-align (Fortran) is faster than CE/CL.
You could install TM-align from the software manager of your computer.
Under Debian:
apt-get install clustalw
tm-align
Alternatively install a Fortran compiler. Then
add the program
option -a3d=tm_align
. There are
a few alternatives to ClustalW, some of which produce more
accurate results but require more time. They will be expected in
the /usr/bin/ directory for example /usr/bin/t_coffee. They can
also be automatically loaded and installed. The unattended
software installation from source code requires the software
installation tools make and C++.
BioDAS annotations
Annotations are loaded for all sequences (Asterisk) or for a list of sequences with a command like
DAS_features CSA%20-%20extended uniprot cbs_total netphos netoglyc , *
and the GFF features from the Expasy server are loaded with GFF_expasy_features *
The "%20" in the feature name is the hexadecimal character code for white space.
After loading the data from the remote servers, the sequence positions are underlined in the alignment.
The DAS-annotation providers are listed in the standard BioDAS registry file
or in supplementary registry files given at the command line.
Underlining these sequence annotations is time consuming. At least the identification of the UNIPROT identifier,
can be accellerated by a local BLAST database and a local Uniprot as described below.
Program features by examples
Loading / creating sequences
- Sequences can be loaded by Uniprot-ID, PDB-ID or URL.
- Sequence names can be specified in the load command.
- The alignment can be directly specified by sequences with gaps.
- The alignment can be read from an alignment file. Here the PFAM entry ... is loaded
3D
3D-views are automatically included in the HTML output for all sequences with 3D-coordinates
Additional 3D views can be defined with the command open_3D.
- For sequences loaded directly from PDB files, a 3D-Java applet can be opened.
- A homologous PDB structure can be associated with a sequence.
Amino acids without 3D-coordinates are written in lower case. Residue mismatches of the 3D-model are underlined with a check-box.
- The optimal (most similar) PDB entry can be detected automatically.
- 3D-Structures of homologous proteins can be superposition
- See the text label in the 3D-view "HB-Beta-chain". 3D-Labels require specification of the atom type to avoid that all atoms of the amino acid are labeled.
- 3D-views are distinguished by their name, here "View1" and "View2".
- 3D-commands can be attached to sequence features loaded from DAS servers.
Annotated residue selections
- The amino acid positions can be specified with different numbering systems:
- Natural numbering 1,2,3, ... of amino acids
- PDB residue numbering
- Nucleotide numbers (If the aminoacid sequence is translated from a nucleotide sequence.)
- PDB insertion codes are capital letters occasionally attached to PDB residue numbers.
In this example there are amino acids with the same
PDB residue 187. They are distinguished by different insertion codes.
- Residue selections can have atom selector lines which specify the atoms for all subsequent style commands.
Nucleotide sequences
- CDS field in EMBL or Genbank files specifies the nucleotide reading orientation and exon boundaries.
The sequence coding can also be specified in the script.
- The nucleotide sequence and the CDS expression can be provided in the script.
- Several splice variant of one nucleotide sequence can be displayed simultaneously.
- A positive or negative nucleotide index offset affects underlined nucleotide selections
- Clipping terminal amino acids does not affect underlined nucleotide selections.
Unless the amino acid sequence is explicitely provided either with the
command aa_sequence or in the field "/translation=" of a
Genbank or Embl formated file, the amino acid sequence is
predicted using the default genetic code. In rare cases the
prediction will be wrong due to a different genetic code (
Stop-codon instead of Tryptophane) or mRNA editing.
Annotation services
Sequence features are a certain type of residue selections.
In the html output the respective sequence positions are underlined with a color specific for the feature name.
They can be shown and hidden with check-boxes.
Sequence features are loaded from external services or created explicitely in the script file.
- www.expasy.org provides annotations for Uniprot entries in General Feature Format (GFF) format.
- Uniprot IDs are required to load the information from
annotation services. Determining IDs automtically and
underlining
annotations.
- Underlining annotations from different DAS servers
- Removing underlinings like "Non_cytoplasmic_region".
- If an homologous 3D-structure exists then the selected residues can be highlighted in 3D.
-
A sequence feature can be explicitely defined in the script file without web services as follows:
(I) A residue annotation is created with the command new_annotation. But instead of setting a color directly,
a color is assigned to the name of the selection with feature_colors.
Sequence groups
Sequence groups are named sets of sequences. Each sequence group has a button to select or deselect the respective sequences.
- The sequences are grouped into alpha and beta hemoglobin chains.
Generating all examples
If strap.jar is downloaded and the web proxy is written to the variable JavaProxy then all
examples in this page can be generated.
The program keeps data in $HOME/.StrapAlign and will therefore run much faster next time.
for i in ; do
FILE=http://www.bioinformatics.org/strap/toHTML/scripts/$i.txt
wget -N $FILE || curl -O $FILE
java $JavaProxy -jar strap.jar -script=$i.txt -toHTML=$i.html || break
done
Contact
christophgil |
 |
goog |
lemail |
. | com |