[Bioclusters] Call for comments in fasta files

Michael James bioclusters@bioinformatics.org
Tue, 3 Feb 2004 16:14:47 +1100


In maintaining a library of blast databases
 a major obstacle is lack of embedded version info.

For example I have a database as a FASTA file,
 there is another version (perhaps?)
 on the local biomirror in tar.gz format. (Curse them.)
And for most databases both formats are available on NCBI.

Are they the same?
Which is newest?
What is the difference?

Sizes and dates provide some indication,
 but to actually compare them is a non-trivial bit of computing.

What is needed is some label string inside the files,
 even if it's just "NCBI nr 03/02/04"

The FASTA file format needs to allow comments
 so this info can be attached to them indivisibly.

I propose that the FASTA format be extended so that programs using it:
1) Strip and store as a comment anything on a line after a # sign.
2) Ignore lines with nothing [but whitespace] left after stripping.

As far as blast is concerned
 this would involve modifying formatdb
 so it takes all such comments
 and includes them in the existing  ".nal" file.
No change needed to the main blastall binary
 as this file already contains # comments.

We are free to develop header fields
 as soon as this mechanism exists to attach them.

Your comments?

michaelj

-- 
Michael James                         michael.james@csiro.au
System Administrator                    voice:  02 6246 5040
CSIRO Bioinformatics Facility             fax:  02 6246 5166