[Bioclusters] Nightly updated BLAST databases

Mon, 16 Dec 2002 20:01:46 -0500

Uncompressing large files and building blast databases takes time. 
Enough time that you usually don't want to be creating new databases in 
the same directory where users may be launching query searches against.

In a previous life several years ago this is what I did, nothing fancy 
at all -- basically I kept a mirror volume + 2 blast database volumes 
(one for the current DB and the 2nd for the 'new' current DBs) and 
switched symbolic links between them:

_Every night_

(1) Mirror raw public datasets into storage volume A

(2) build ncbi-blast, wu-blast, fasta and gcg formatted databases into 
storage volume B

(3) run simple quality control script(s) to make sure a failed download 
or truncated file is not going to ruin my day. There are little things 
you can check like making sure that new database is the same size or 
larger than the old database etc. etc. This can be as simple or as 
complex as your computing demands require.

(4) if all looks good and there are no blast jobs running then change 
the symbolic link(s) so that your newly built databases in volume B are 
the ones that people use when a search is fired off. As before the 
methods for figuring out 'are there any searches running' can be as 
complex or a simple as the production environment demands

(5) build the next days databases in volume C

(6) rinse, repeat

All in all I've seen many systems where end users, pipelines and power 
users are pretty tolerant of transient errors so a massive and 
complicated blast updates system was not considered crucial. Others were 
small shops or small groups who could enforce a 'nobody does searches 
between 4-6am' rule. Any queries that were 'Of Interest' were searched 
so many times and via so many different mechanisms that having a bullet 
proof 'db integrity protection system' was worth only a moderate amount 
of effort and resources.

-Chris

Jeremy Mann wrote:
> I am implementing a nightly updated and formatdb script for the BLAST
> unformatted databases from NCBI. A researcher today asked a question that
> I could not answer. His question was, if he runs a long BLAST search
> during the time my script is running, what will happen to his returned
> search? Will he get false positive from both databases (the old one and
> the newly created one)? Will the database be locked out during his search
> and my script will fail?
> 
> I was amazed that I didn't think of this sooner. What does everybody here
> use as a script and how do you prevent the database from being newly
> formatted if a current BLAST search is running?
> 
> Thanks for any answers.
> 
> 
>