[Bioclusters] Versioning databases

Michael James Michael.James at csiro.au
Sun Jun 4 22:29:39 EDT 2006


Some biological databases actually come in versions,
 for example;  we are up to the TIGR4 rice genome and
 swisprot UniProtKB/Swiss-Prot Release 50.0 of 30-May-2006

Others just change daily, NCBI:nr  NCBI:nt  etc.

All this effort creates a problem for repeatability,
 the blast results you get next week
 won't quite be the ones you got today.

It seems to me that the situation would be improved
 by tagging results "BLAST against ncbi.nih.gov nr 2006-06-05 000"

This means we need to come up with a versioning scheme
 and for anything without, I'd suggest something as simple as
   issuing_authority  database  date    3_digit_release_number
eg  ncbi.nih.gov           nr  2006-06-05          000

For uniqueness, use the internet name for issuing_authority.

The database is the filename stripped of all qualifiers
Remove things like  .gz  .00.tar.gz  

The date in ISO format!

3 more digits to ensure uniqueness.


Such a scheme would also be
 a big win for us database administrators.
We could start to weave it through the tangled web
 of different providers and formats
 so we actually know the original issuing authority
 for the file we are downloading.

What do you think?
michaelj


-- 
Michael James                         michael.james at csiro.au
System Administrator                    voice:  02 6246 5040
CSIRO Bioinformatics Facility             fax:  02 6246 5166

No matter how much you pay for software,
 you always get less than you hoped.
Unless you pay nothing, then you get more.


More information about the Bioclusters mailing list