7.-...
|
[<extra_attributes...>] |
One ore more space delimited optional
attributes for this read may follow. Their order is not enforced and
they should all start with a letter code followed by the ':' character
and then followed by attribute specific text data (spaces not allowed).
General format is: <attr_code>:<attr_data>
Attributes recognized by the current specification:
C:<grp#> |
group color for the current read. This instructs the viewer to
draw this read with a specific color uniquely associated to the given <grp#>
|
L:<other_readName,..> |
[clone] link to another
read in the layout. This instructs the viewer to place the other read
on the same vertical line in the layout display (if possible), with
perhaps a dotted line connecting such reads; a comma delimited list of
read names can be given if such links extend to more than one other
read. Only one read in such a linked list needs to have such a L:
entry in order to declare the linked list/group of reads (that is, the
other linked reads do not need to reciprocate by having the
corresponding, but redundant L: attribute).
|
S:<sequence> |
The nucleotide sequence
of the read exactly as it is included in the alignment. It must include
the clipped ends and the small gaps (indels) introduced by the MSA
(represented as '-' or '*' in the <sequence>) -- so the length of such <sequence> string must be equal to <read_length> |
G:<seg1_end[c<seq1clipright>][s|S]-
seg2_start[c<seg2clipleft>][s|S],...> |
Segmented alignment
(e.g. EST to genome). This is an indication that the read contains
large internal gaps -- which should be displayed as segments
connected by lines. The data for this attribute consists of a comma
delimited list of coordinate pairs for the inter-segment gaps.
Coordinates in a pair are separated by the '-' character. For each pair
the first coordinate is the end position in the layout of the previous segment, while the second coordinate is the start position of the next segment in the layout.
Example: say we have a "read" (e.g. a mRNA) called "MRNA244" which
aligns onto the "contig" (e.g. genomic sequence) of length 34000 as 3
distinct segments (e.g. 3 exons), aligned at genomic coordinates: 300
to 500 (first segment), 800 to 1100 (second segment) and 1500 to 1900
(third segment) respectively. Assuming there is a 30nt clipping at the
left end and 20nt clipping at the right end, and that the alignment has
a "forward" orientation, the contig and the sequence line for MRNA244
in the layout file would look like this (let's assume there are 281
"reads" total in this imagined layout):
>contig1 281 1 34000
MRNA244 + 953 270 30 20 G:500-800,1100-1500
The actual length of the "read" accounts for the length of each segment
(201, 301 and 401 respectively) plus the clipping lengths at each end
(20 and 30), so the total is 201+301+401+20+30=953
The left coordinate of the sequence in the alignment (270) is equal to
the position of the first (leftmost) segment (300) minus the left-end
clipping (30).
There is a possibility to have clipping for each segment. This can be
specified for each segment's end by appending the character c
followed by the amount of clipping at that end. If in the example
segment alignment above we had the 1st exon clipped 10 nucleotides at
the right end, the 2nd exon clipped 5nt at the left end and 7nt at the
right end, with the 3rd exon having 9nt clipped at the left end, the
above read line may look like this:
MRNA244 + 953 270 30 20 G:490c10-805c5,1093c7-1509c9
Please note that in this last example the actual coordinates of the
alignment of the 3 segments (exons) to the genomic sequence are
300-490, 805-1093 and 1509-1900 respectively. The way the clipping is
specified in this G: attribute differs from the way the leftmost and
rightmost clipping of the whole read is given. The difference is that
the c
clipping lengths in the G: attribute lie OUTSIDE the coordinates given
for the segment ends in the same G: attibute, while the global leftmost
clipping (30 in the example above) is included in the offset coordinate
for the whole read (270 here).
For EST to genome alignments, an optional 'S' (or 's') character may
follow the inter-segment ends, indicating that a splice consensus
(major or minor, respectively) was found on that side of the intron
corresponding to that inter-segment gap.
|
D:<seq_diffs..>
or for segmented (G:) reads:
D:<seg1_diffs..>/<seg2_diffs..>
|
If the contig sequence is given, this read attribute is the way to provide the display application with only a list of point-differences between this read's sequence and the contig sequence (so S: is not be needed). The <seq_diffs> is a concatenation of elements of this format:
<incremental_coordinate><character>
..where <incremental_coordinate> is the numeric position of such a difference relative to
the previous difference -- or if no such previous difference exists, relative to
the first (leftmost) non-clipped nucleotide of the read. This incremental
coordinate (which is always at least 1) must be followed by a <character>
code. This character can be either an actual DNA base letter ('A', 'G', 'C', 'T', 'N', etc.) -- which
indicates a nucleotide mismatch at that position, or the "dash"
character ('-') indicating a gap in the alignment or this read to the
contig sequence.
Example: the following alignment is there between contig sequence and the read sequence:
contig: ..A G T T G C T - C C T A - C T A C A G A C C N G...
read: ..A G T - C C T T C C A A N C T - - A T A C C A G...
(increments: 4 1 3
3 2 3 1 2
4 ... )
Assuming that the above alignment starts at position 200 in the contig
and that the read called RDAAA of length 620 has 20 bp clipping right
before this alignment (so the read left end coordinate in the layout is
181), the following line description would apply to this read in the
layout file (the ending ellipsis ... is not part of the actual text but
just e placeholder for possible other differences to report):
RDAAA + 620 181 20 0 D:4-1C3T3A2N3-1-2T4A...
Note that in this compact MSA representation there is no information
provided about the nucleotide content of the clipped ends of the read.
The viewer application may choose to represent such clipped regions as
empty or gray boxes (rectangles), with the actual nucleotides only
displayed in the non-clipped regions.
For segmented alignments (i.e. those reads having a G:
attribute), multiple such lists should be given (one for each segment),
separated by the '/' (slash) character. For each such
segment-differences list, the first incremental coordinate will be the
distance from the beginning of the first (leftmost) non-clipped
nucleotide of that segment.
|
I:<seq_indels..>
or for segmented (G:) reads:
I:<seg1_indels..>/<seg2_indels..>
|
Similar to the D: attribute, only that the original
sequence for the read is assumed known from other sources (e.g. an
indexed multi-FASTA file) and only gaps and deletions are reported as
the operations needed to make that sequence fit into the current MSA.
The coordonate system is now entirely based on the original, raw
read sequence, but with the same adjustment of the start coordinate
based on the left clipping (i.e. all coordinates are relative to the
first (leftmost) base in the read that is not clipped but actually used
in the MSA).
The <seq_indels> is a concatenation of elements of this format:
<incremental_coordinate><indel_char>
..where <indel_char> can
be either '-' (gap) or 'd' (deletion) at the specific base
position in the original read sequence. The actual base position can be
obtained by this iterative formula:
<base_position> = <incremental_coordinate> + <prev_base_position>
..where <prev_base_position> = <clip_left> for the first iteration (first element of <seq_indels>)
|
R:<gaplist>/<contig_gaplist>
|
a special attribute for
the Assembly-on-Reference procedure, providing all gap information for
the alignment of this read to the parent contig. The gapping
information (as produced by mgblast with -D5 option) is stored directly
in this attribute data: the gaps in the read are in the first list <gaplist> and the gaps in the contig (reference) sequence are stored in <contig_gaplist>. The two lists are separated by the '/' (slash) character. Just like
mgblast's -D5 output, the gap list has the format:
<gap1pos>[+<gap1length>],<gap2pos>[+<gap2length>],...
The nrcl
(non-redundandification clustering) program automatically writes this
attribute in the layout file produced when the -y option is given (if
the gap information is available in the parsed mgblast hits).
The mblaor (mgblast assembly-on-reference) program requires this
attribute to be present when parsing an input layout file in order to
generate a full MSA, transforming this info into indel operations
applied to the read.
|
|