Layout file format specification (proposal)


Besides the usual ACE files, clview also recognizes a proprietary "layout" format (stored in file usually having the extension .lyt) for representing multiple sequence alignment (MSA) layouts -- where typically smaller "component"sequences (henceforth called "reads") are aligned to (or are making up) a larger sequence (henceforth called "the contig").

These layout files are text files having a pseudo-FASTA format-- each FASTA record representing one contig's "layout", like this:

><contigName> <number_of_reads> <contig_start_coord> <contig_end_coord> [<sequence>]
<readName> <orientation> <read_length> <read_start_coord> <clip_left> <clip_right> [ <extra_attributes...>]
.
.
All the fields on every line are space delimited (tab or plain space). Therefore no spaces are allowed within the fields (so contig and read names are not allowed to contain spaces).
Each FASTA-like record in such multi-layout file represents one layout definition (a multiple alignment space). Every such layout definition must start with line beginning with the '>' character and containing some general contig/layout data (contig name, number of component reads, the start/end coordinates for the layout space and optionally the actual contig sequence, if any). This first contig/layout general info line must be followed by exactly <number_of_reads> lines containing component/read information (one line per read). For each read line , that fields are as follows:
1.
<readName> a sequence identifier, unique within the current contig/layout
2.
<orientation> one character: '+' or '-', representing the forward or reverse orientation of the read in the current layout
3.
<read_length> the actual length of the read (including the clipped ends).  If segmented (see the G: option of <extra_attributes>), the intra-segment gaps are not considered as part of read length in the layout.
4.
<read_start_coord> the leftmost (lowest) coordinate of this read in the current layout. The position could be "virtual" if the read is clipped at that left end. The orientation of the read does not matter for this assessment.
5.
<clip_left> the number of nucleotides trimmed at the left end. (Orientation doesn't matter)
6.
<clip_right> the number of nucleotides trimmed at the right end of this read.
7.-...
[<extra_attributes...>] One ore more space delimited optional attributes for this read may follow. Their order is not enforced and they should all start with a letter code followed by the ':' character and then followed by attribute specific text data (spaces not allowed). General format is:  <attr_code>:<attr_data>

Attributes recognized by the current specification:
C:<grp#> group color for the current read. This instructs the viewer to draw this read with a specific color uniquely associated to the given <grp#>
L:<other_readName,..> [clone] link to another read in the layout. This instructs the viewer to place the other read on the same vertical line in the layout display (if possible), with perhaps a dotted line connecting such reads; a comma delimited list of read names can be given if such links extend to more than one other read. Only one read in such a linked list  needs to have such a L: entry in order to declare the linked list/group of reads (that is, the other linked reads do not need to reciprocate by having the corresponding, but redundant L: attribute).
S:<sequence> The nucleotide sequence of the read exactly as it is included in the alignment. It must include the clipped ends and the small gaps (indels) introduced by the MSA (represented as '-' or '*' in the <sequence>) -- so the length of such <sequence> string must be equal to <read_length>
G:<seg1_end[c<seq1clipright>][s|S]-
 seg2_start
[c<seg2clipleft>][s|S],...>
Segmented alignment (e.g. EST to genome). This is an indication that the read contains large internal gaps -- which should be displayed as segments connected by lines. The data for this attribute consists of a comma delimited list of coordinate pairs for the inter-segment gaps. Coordinates in a pair are separated by the '-' character. For each pair the first coordinate is the end position in the layout of the previous segment, while the second coordinate is the start position of the next segment in the layout.

Example: say we have a "read" (e.g. a mRNA) called "MRNA244" which aligns onto the "contig" (e.g. genomic sequence) of length 34000 as 3 distinct segments (e.g. 3 exons), aligned at genomic coordinates: 300 to 500 (first segment), 800 to 1100 (second segment) and 1500 to 1900 (third segment) respectively. Assuming there is a 30nt clipping at the left end and 20nt clipping at the right end, and that the alignment has a "forward" orientation, the contig and the sequence line for MRNA244 in the layout file would look like this (let's assume there are 281 "reads" total in this imagined layout):

>contig1 281 1 34000
MRNA244 + 953 270 30 20 G:500-800,1100-1500

The actual length of the "read" accounts for the length of each segment (201, 301 and 401 respectively) plus the clipping lengths at each end (20 and 30), so the total is 201+301+401+20+30=953
The left coordinate of the sequence in the alignment (270) is equal to the position of the first (leftmost) segment (300) minus the left-end clipping (30).

There is a possibility to have clipping for each segment. This can be specified for each segment's end by appending the character c followed by the amount of clipping at that end. If in the example segment alignment above we had the 1st exon clipped 10 nucleotides at the right end, the 2nd exon clipped 5nt at the left end and 7nt at the right end, with the 3rd exon having 9nt clipped at the left end, the above read line may look like this:

MRNA244 + 953 270 30 20 G:490c10-805c5,1093c7-1509c9

Please note that in this last example the actual coordinates of the alignment of the 3 segments (exons) to the genomic sequence are 300-490, 805-1093 and 1509-1900 respectively. The way the clipping is specified in this G: attribute differs from the way the leftmost and rightmost clipping of the whole read is given. The difference is that the c clipping lengths in the G: attribute lie OUTSIDE the coordinates given for the segment ends in the same G: attibute, while the global leftmost clipping (30 in the example above) is included in the offset coordinate for the whole read (270 here).

For EST to genome alignments, an optional 'S' (or 's') character may follow the inter-segment ends, indicating that a splice consensus (major or minor, respectively) was found on that side of the intron corresponding to that inter-segment gap.
 
D:<seq_diffs..>
 or for segmented (G:)  reads:
D:<seg1_diffs..>/<seg2_diffs..>


If the contig sequence is given, this read attribute is the way to provide the display application with only a list of point-differences between this read's sequence and the contig sequence (so S: is not be needed).  The <seq_diffs> is a concatenation of elements of this format:
<incremental_coordinate><character>

..where <incremental_coordinate> is the numeric position of such a difference relative to the previous difference -- or  if no such previous difference exists, relative to the first (leftmost) non-clipped nucleotide of the read. This incremental coordinate (which is always at least 1) must be followed by a <character> code. This character can be either an actual DNA base letter ('A', 'G', 'C', 'T', 'N', etc.) -- which indicates a nucleotide mismatch at that position, or the "dash" character ('-') indicating a gap in the alignment or this read to the contig sequence.

Example: the following alignment is there between contig sequence and the read sequence:

contig:      ..A G T T G C T - C C T A - C T A C A G A C C N G...
read:        ..A G T - C C T T C C A A N C T - - A T A C C A G...
(increments:         4 1     3     3   2     3 1   2       4  ... )

Assuming that the above alignment starts at position 200 in the contig and that the read called RDAAA of length 620 has 20 bp clipping right before this alignment (so the read left end coordinate in the layout is 181), the following line description would apply to this read in the layout file (the ending ellipsis ... is not part of the actual text but just e placeholder for possible other differences to report):

RDAAA + 620 181 20 0 D:4-1C3T3A2N3-1-2T4A...

Note that in this compact MSA representation there is no information provided about the nucleotide content of the clipped ends of the read. The viewer application may choose to represent such clipped regions as empty or gray boxes (rectangles), with the actual nucleotides only displayed in the non-clipped regions.

For segmented alignments (i.e. those reads having a G: attribute), multiple such lists should be given (one for each segment), separated by the '/' (slash) character. For each such segment-differences list, the first incremental coordinate will be the distance from the beginning of the first (leftmost) non-clipped nucleotide of that segment.

I:<seq_indels..>
or for segmented (G:)  reads:
I:<seg1_indels..>/<seg2_indels..>
Similar to the D: attribute, only that the original sequence for the read is assumed known from other sources (e.g. an indexed multi-FASTA file) and only gaps and deletions are reported as the operations needed to make that sequence fit into the current MSA.
The coordonate system is now entirely based on the original, raw read sequence, but with the same adjustment of the start coordinate based on the left clipping (i.e. all coordinates are relative to the first (leftmost) base in the read that is not clipped but actually used in the MSA).
The <seq_indels> is a concatenation of elements of this format:

<incremental_coordinate><indel_char>

..where  <indel_charcan be either '-' (gap) or 'd' (deletion)  at the specific base position in the original read sequence. The actual base position can be obtained by this iterative formula:

<base_position>  =  <incremental_coordinate> + <prev_base_position>

..where <prev_base_position> = <clip_left> for the first iteration (first element of  <seq_indels>)

R:<gaplist>/<contig_gaplist>
a special attribute for the Assembly-on-Reference procedure, providing all gap information for the alignment of this read to the parent contig. The gapping information (as produced by mgblast with -D5 option) is stored directly in this attribute data: the gaps in the read are in the first list <gaplist> and the gaps in the contig (reference) sequence are stored in <contig_gaplist>. The two lists are separated by the '/' (slash) character. Just like mgblast's -D5 output, the gap list has the format:

<gap1pos>[+<gap1length>],<gap2pos>[+<gap2length>],...

The nrcl (non-redundandification clustering) program automatically writes this attribute in the layout file produced when the -y option is given (if the gap information is available in the parsed mgblast hits).

The mblaor (mgblast assembly-on-reference) program requires this attribute to be present when parsing an input layout file in order to generate a full MSA, transforming this info into indel operations applied to the read.