[Bioclusters] Nightly updated BLAST databases
Chris Dagdigian
bioclusters@bioinformatics.org
Mon, 16 Dec 2002 20:01:46 -0500
Uncompressing large files and building blast databases takes time.
Enough time that you usually don't want to be creating new databases in
the same directory where users may be launching query searches against.
In a previous life several years ago this is what I did, nothing fancy
at all -- basically I kept a mirror volume + 2 blast database volumes
(one for the current DB and the 2nd for the 'new' current DBs) and
switched symbolic links between them:
_Every night_
(1) Mirror raw public datasets into storage volume A
(2) build ncbi-blast, wu-blast, fasta and gcg formatted databases into
storage volume B
(3) run simple quality control script(s) to make sure a failed download
or truncated file is not going to ruin my day. There are little things
you can check like making sure that new database is the same size or
larger than the old database etc. etc. This can be as simple or as
complex as your computing demands require.
(4) if all looks good and there are no blast jobs running then change
the symbolic link(s) so that your newly built databases in volume B are
the ones that people use when a search is fired off. As before the
methods for figuring out 'are there any searches running' can be as
complex or a simple as the production environment demands
(5) build the next days databases in volume C
(6) rinse, repeat
All in all I've seen many systems where end users, pipelines and power
users are pretty tolerant of transient errors so a massive and
complicated blast updates system was not considered crucial. Others were
small shops or small groups who could enforce a 'nobody does searches
between 4-6am' rule. Any queries that were 'Of Interest' were searched
so many times and via so many different mechanisms that having a bullet
proof 'db integrity protection system' was worth only a moderate amount
of effort and resources.
-Chris
Jeremy Mann wrote:
> I am implementing a nightly updated and formatdb script for the BLAST
> unformatted databases from NCBI. A researcher today asked a question that
> I could not answer. His question was, if he runs a long BLAST search
> during the time my script is running, what will happen to his returned
> search? Will he get false positive from both databases (the old one and
> the newly created one)? Will the database be locked out during his search
> and my script will fail?
>
> I was amazed that I didn't think of this sooner. What does everybody here
> use as a script and how do you prevent the database from being newly
> formatted if a current BLAST search is running?
>
> Thanks for any answers.
>
>
>