[BiO BB] gff to sequence

Mon Oct 5 04:22:33 EDT 2009

2009/10/3 Leo Goodstadt <leo.goodstadt at dpag.ox.ac.uk>:
>> >> Is there a way to quickly extract out the coordinates from a gff file
>> >> and the corresponding sequence from a fasta file?
>> >>
>>
> This seems of such general use that it begs a small utility which will
> take a (possibly indexed) fasta file, a gff and output the sequences you
> want. What would people want from such a programme?

At least one user wants the following: given a GFF file, produce a
multi Fasta sequence file with one sequence from each 'feature' in the
GFF file. Each feature sequence should be derived from the
corresponding reference sequence. Features should probably be
restricted to certain types, as zero length or single base features
are probably not that interesting.

> Is GTF (http://mblab.wustl.edu/GTF2.html) more useful or GFF?
> Would different elements from the same group (gene/transcript) be joined
> together in order?

I don't think so. I think GTF was invented to overcome some
limitations with GFF2. However, GFF3 is now the standard:

http://www.sequenceontology.org/gff3.shtml

(I can't believe how incredibly annoying the background image for that page is!)

As far as I know there are no pending 'improvements' to GFF3.

> Would one want filtering on the "features" column so one could retrieve all
> splice sites or codon exons?

That would be a nice feature, and would be easy to implement.

> What would be the output? Another fasta file? How would each "group" of
> Sequences (e.g. transcript) be labelled? By a user supplied regular expression?

I think the required output is a multi Fasta file. The GFF3 format
requires each feature to have a unique ID, so I'd suggest simply using
that as the sequence ID (no point re-inventing the wheel). You could
then include the feature name (if present) and the reference sequence
id in the remainder of the Fasta def line
(http://en.wikipedia.org/wiki/FASTA_format).

>> I guess it depends what you mean by quick- quick to write you could use awk
>> but then it depends what additional things you want to do with results.=20
>> I ended up writing a C++ fasta utility program since PERL can slow down som=
>> etimes but I ended up grabbing a couple of regex libraries to let me=20
>> grep names etc.=20
> I hoped you used boost:regex which will be in the next c++ standard
> (http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/index.html) and
> is as easy to use and powerful as perl/python regular expressions (though
> c rules on escaping backslashes are a pain).
> Leo
> Leo Goodstadt

One thing to consider: if the reference sequence isn't part of the GFF
file and/or isn't passed as a separate Fasta file, the DAS registry
could be queried in order to obtain the URI of a reference server that
provides the sequence:

http://www.dasregistry.org/

That takes the project one step beyond a simple parser, so its
something to think about rather than an explicit requirement.
Otherwise I think your right, A little tool to do what you suggested
could be very useful!

Cheers,
Dan.

> _______________________________________________
> BBB mailing list
> BBB at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/bbb
>