Generic Feature Format
From Bioinformatics.Org Wiki
The Generic Feature Format (GFF) is a data format for identifying the features of a sequence. Unlike GenBank and XML documents, GFF presents feature data in a tab-delimited table, one feature per line, which makes it ideal for use with the text manipulation and data analysis tools that work with tabular data: spreadsheets and various Unix commands.
Contents[hide] |
GFF Version 2
In 2000, the Wellcome Trust Sanger Institute published the specification for GFF Version 2, with the following explanation:
The main change from Version 1 to Version 2 is the requirement for a tag-value type structure (essentially semicolon-separated .ace format) for any additional material on the line, following the mandatory fields. Version 2 also allows '.' as a score, for features for which there is no score.
Example from the Sanger specification:
SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0 SEQ1 EMBL splice5 172 173 . + . SEQ1 netgene splice5 172 173 0.94 + . SEQ1 genie sp5-20 163 182 2.3 + . SEQ1 genie sp5-10 168 177 2.1 + . SEQ2 grail ATG 17 19 2.1 - 0
Forks
Forks of GFF, such as GTF, were created to address the specific needs of some projects.
GFF Version 3
In 2006, Lincoln Stein wrote the specification for GFF Version 3, with the following explanation:
Although there are many richer ways of representing genomic features via XML, the stubborn persistence of a variety of ad-hoc tab-delimited flat file formats declares the bioinformatics community's need for a simple format that can be modified with a text editor and processed with shell tools like grep. The GFF format, although widely used, has fragmented into multiple incompatible dialects. When asked why they have modified the published Sanger specification, bioinformaticists frequently answer that the format was insufficient for their needs, and they needed to extend it. The proposed GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats. The new format:
- adds a mechanism for representing more than one level of hierarchical grouping of features and subfeatures
- separates the ideas of group membership and feature name/id
- constrains the feature type field to be taken from a controlled vocabulary
- allows a single feature, such as an exon, to belong to more than one group at a time
- provides an explicit convention for pairwise alignments
- provides an explicit convention for features that occupy disjunct regions
Example from the Sequence Ontology Project specification:
##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001 ctg123 . exon 1300 1500 . + . Parent=mRNA00003 ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002 ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003 ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003 ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002 ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002 ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002 ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 5000 5500 . + 1 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003 ctg123 . CDS 5000 5500 . + 1 ID=cds00004;Parent=mRNA00003 ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003
Validation
Software
- gff2ps - Converting genomic annotations in GFF format to PostScript