[Bioclusters] Versioning databases

J.W. Bizzaro jeff at bioinformatics.org
Mon Jun 5 15:25:17 EDT 2006


Dan Bolser once suggested the use of a software packaging system like RPM for 
providing updates to DBs containing multiple flat files.  It's especially 
appealing if it's done in combination with a downloader like yum, and I think 
it's something that Bioinformatics.Org might pursue.  It may be relevant to 
your suggestion, since package managers are aware of version numbers and can 
revert an installed package to an old version.  Large DBs contained in a single 
file would be problematic, though.

Cheers,
Jeff

Michael James wrote:
> Some biological databases actually come in versions,
>  for example;  we are up to the TIGR4 rice genome and
>  swisprot UniProtKB/Swiss-Prot Release 50.0 of 30-May-2006
> 
> Others just change daily, NCBI:nr  NCBI:nt  etc.
> 
> All this effort creates a problem for repeatability,
>  the blast results you get next week
>  won't quite be the ones you got today.
> 
> It seems to me that the situation would be improved
>  by tagging results "BLAST against ncbi.nih.gov nr 2006-06-05 000"
> 
> This means we need to come up with a versioning scheme
>  and for anything without, I'd suggest something as simple as
>    issuing_authority  database  date    3_digit_release_number
> eg  ncbi.nih.gov           nr  2006-06-05          000
> 
> For uniqueness, use the internet name for issuing_authority.
> 
> The database is the filename stripped of all qualifiers
> Remove things like  .gz  .00.tar.gz  
> 
> The date in ISO format!
> 
> 3 more digits to ensure uniqueness.
> 
> 
> Such a scheme would also be
>  a big win for us database administrators.
> We could start to weave it through the tangled web
>  of different providers and formats
>  so we actually know the original issuing authority
>  for the file we are downloading.
> 
> What do you think?
> michaelj
> 
> 

-- 
J.W. Bizzaro
Bioinformatics Organization, Inc. (Bioinformatics.Org)
E-mail: jeff at bioinformatics.org
Phone:  +1 508 890 8600
--


More information about the Bioclusters mailing list