MSAReveal Help
- Collect amino acid sequences, e.g. from UniProt.Org.
Instructions are provided.
- Align sequences.
Instructions are provided
using free, straightforward, powerful Jalview. MSAReveal does not align sequences.
- Save the alignment in a file in FASTA format.
- Display the alignment, copy, and paste into MSAReveal.
- Press the button Process Sequences.
No problem! MSAReveal shows you the 3-letter abbreviation in a tooltip
whenever you touch a one-letter code in the color scheme options, or in the
sequence alignment listing. When you touch a one-letter code column header in the
statistics table, the full name of the amino acid is shown.
And here is a
handy reference chart.
We recommend downloading FASTA sequences from
UniProt.Org:
- At UniProt.Org, use the search slot at the top to describe a sequence.
Examples: "yeast gal4", "sulfurreducens pila", "human pla2g6".
- In the list of hits, click on the Entry code (in the left column of the table)
for the sequence you want. (We recommend viewing the entire entry to confirm this is
what you want.)
- Click on the blue Sequence button at the left side of the page.
For a single sequence:
- Click on the blue FASTA button.
- Open your browser's File menu, and click Save Page As.
- You may wish to rename the file to add the name of the protein or taxon. Keeping the
file type ".fasta" is a good idea.
For a group of sequences:
- Click on the blue button Add to basket.
- When you have added all the desired sequences to your basket,
scroll to the top of the page and click on the blue Basket button.
- In the box that opens, click on Download.
- Select Uncompressed and click Go.
- Select Save File and click OK.
You can now open your saved FASTA file (a plain text editor would be ideal, see below),
select all, copy, and paste into MSAReveal.
NOTE that your sequences are not yet aligned. See
How To Align Sequences.
FASTA files are plain text. You can edit them with a plain text editor, for example
to separate or gather sequences. A plain text editor is one which does not
"mark up" the text with formatting codes. In Windows, use Notepad. In Mac, use the free
program
TextWrangler. If you use WordPad, Word, TextEdit, or
other "word processor" programs, it is often tricky to force the program to save as plain text.
We recommend the free program
Jalview
because it is straightforward, and preserves the full UniProt headers (including genus and
species).
Jalview requires that free
Java be installed on your computer. Alignments done in UniProt
suffer from FASTA headers that have only the UniProt Accession Number,
without the taxon (genus and species).
Instructions for Jalview:
- You will need files containing FASTA sequences that have been saved on your computer.
See
How To Download FASTA Sequences.
- Run Jalview.
- Drag a file containing one or more FASTA sequences and drop into Jalview.
A window should appear that displays the sequence(s) at the top.
- Drag additional files into the SAME window if you wish to add more sequences.
- At the top of the window containing your sequences, click on Web Service
and then click on Alignment.
- Choose an alignment algorithm (such as MAFFT, MUSCLE, or TCOFFEE) and click on
with defaults.
- A second window opens and the alignment is performed. If you have many or long sequences,
this might take a while.
- A third window titled "So and so alignment" opens when the alignment is completed.
- Open the File menu at the top left of the third window, and "Save As".
You may want to double-click on Desktop to save it there temporarily.
Use FASTA format, and name the file appropriately.
-
Your saved alignment is now ready to open (a plain text editor would be good), select all,
copy and paste into MSAReveal.
Options:
Options (preferences) are remembered automatically between sessions, unless you have disabled
"cookies" in your browser.
Sequences:
- There is no maximum sequence length or maximum number of sequences. Tests have included
human titin (34,350 amino acids) and an alignment with 99 sequences of length 345.
- Various
error conditions
are detected and reported.
- A number of sample sequence alignments (and one unaligned set) are provided. Press
the button "Show Demos & Tests" above the sequence input box.
Headers:
- UniProt headers work best but other header formats can be used.
- Header formats can be mixed in the same group of sequences.
- Genus and species will be tabulated when given in the header following "OS=" (UniProt format).
- UniProt 6 or 10 character
Accession Codes
are detected (regardless of the surrounding characters)
and tabulated with links to UniProt.
UniParc
Identifiers (beginning "UPI") are also used.
If none of these are found, UniProt
Entry Names
are looked for.
-
The gene name is tabulated when given in the header following "GN=" (UniProt format).
-
If a 4-character
PDB Entry Code
is added to a header following "PDB=", it will be tabulated in the Statistics
table and linked to display the 3D model in
FirstGlance in Jmol.
Demo: 9: Pilins.
-
When a description of an alignment is
added to a header, it will be displayed
above the sequences table.
-
When a description of an individual sequence
added to its header, it will be displayed
when the Taxon of that sequence is touched with the mouse.
Output:
- Sequences can be displayed in a single horizontally-scrolling table, or broken into
multiple tables ("wrapped") of specified length (default 100 amino acids each).
- Touching any amino acid reports its sequence number in a tooltip,
counting the first amino acid as
number one.
- The statistics table can be sorted by any column. Row numbers remain intact and can be
used to cross-reference between the sequences table and list of full headers. The table can
be "unsorted" by sorting on the row number column.
- A single color scheme for amino acids is provided in this version. Others can be added
by contacting emartz@microbio.umass.edu.
- The state of checkboxes (colors applied or not, output wrapped or not) and other
preferences are remembered between sessions and runs (using browser "cookies").
Consensus:
A consensus is shown below the sequence alignment. Touching any position (column) in the
consensus reports the frequencies of amino acids and dashes in that column in a tooltip.
Here is the key to the characters in the consensus line:
-
A
Black upper case letters: 100% identical.
-
A
Gray upper case letters: all but one (when 4-9 sequences),
or >=90% (when 10 or more sequences).
-
a
Gray lower case letters: >50% (when 3 or more sequences).
-
.
Gray period ("dot"): "similar", >=90% in a single similarity group (therefore 100% in a single
similarity group if there are fewer than 10 sequences).
Similarity Groups:
- ILMV AC (hydrophobic, not aromatic)
- FYW (aromatic)
- NQ ST Y (polar, not charged)
- DEKR H (charged)
- GP (P is helix-breaking; turns frequently include one or both)
- Note that Y is included in both aromatic and polar, not charged.
Statistics:
- The length of each sequence (exclusive of gaps/dashes) is given.
- The length of the sequences in the alignment, including gaps/dashes,
is given in the Consensus line below the aligned sequences.
-
The number of identical residues, and percentage of identical residues, relative to the
first ("Reference") sequence. For the percentage, the denominator is the length of the
sequence, regardless of whether the reference sequence is shorter.
- Counts and percentages of various residues and groups of residues. More amino acids
or groups can be added on request (emartz@microbio.umass.edu).
- Net charge near neutral pH.
- Number of gaps (groups of one or more consecutive dashes), dashes ("gapped" positions),
and dashes as percentage of the length (denominator includes dashes).
The following conditions are detected and reported.
Each of these can be demonstrated with one of the
Demo tests provided.
- No header. Demo: Header Missing.
- Illegal characters not representing amino acids. Demo: Illegal Characters.
- Nucleic acid sequence instead of protein sequence. Demo: DNA/RNA.
- Legal but ambiguous amino acid characters BJOUXZ. Demo: 1: With Gaps, Ambiguous AA.
- A single sequence containing gaps (dashes), hence not an alignment. Demo: 1: With Gaps, Ambiguous AA.
- Alignment having sequences of different lengths. Demo: Mismatched Lengths.
- Header containing more than one distinct 6- or 10-character UniProt Accesion Numbers. Demo: Multiple accession numbers.
When a sequence has an empirical 3D structure in the
Protein Data Bank,
you may add "PDB=xxxx" to the header, where xxxx is the
PDB accession code.
Such PDB codes will appear in a "3D" column in the Statistics table, linked to display
the corresponding structures in
FirstGlance in Jmol.
The addition must be before
>> or >>>.
Example: Demo "9: Pilins".
Group Descriptions:
If you add, for example, ">>> Aligned by MAFFT" to the end of a header,
this will be displayed
above the table of sequences, with a
light green background. Such a group description would normally be added to only
one header in a group of sequences. If several headers contain
">>>",
the descriptions will be concatenated. Example: Gal4 Demo.
Sequence Descriptions:
If you add, for example, ">> Mutant Y57W" to the end of a header, when you touch the
Taxon in this row with the mouse, this sequence descripton will be shown above
the table of sequences, with a
pink background. Example: Gal4 Demo.
">>>" and ">>" can be in either order, but both must be at the end
of the header.