[BiO BB] gff to sequence

Leo Goodstadt leo.goodstadt at dpag.ox.ac.uk
Sat Oct 3 09:14:51 EDT 2009


> >> Is there a way to quickly extract out the coordinates from a gff file
> >> and the corresponding sequence from a fasta file?
> >>
> 
This seems of such general use that it begs a small utility which will
take a (possibly indexed) fasta file, a gff and output the sequences you
want. What would people want from such a programme?
Is GTF (http://mblab.wustl.edu/GTF2.html) more useful or GFF?
Would different elements from the same group (gene/transcript) be joined
together in order?
Would one want filtering on the "features" column so one could retrieve all
splice sites or codon exons?
What would be the output? Another fasta file? How would each "group" of
Sequences (e.g. transcript) be labelled? By a user supplied regular expression?


> I guess it depends what you mean by quick- quick to write you could use awk
> but then it depends what additional things you want to do with results.=20
> I ended up writing a C++ fasta utility program since PERL can slow down som=
> etimes but I ended up grabbing a couple of regex libraries to let me=20
> grep names etc.=20
I hoped you used boost:regex which will be in the next c++ standard
(http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/index.html) and
is as easy to use and powerful as perl/python regular expressions (though 
c rules on escaping backslashes are a pain).
Leo
Leo Goodstadt




More information about the BBB mailing list