[Bioclusters] Nightly updated BLAST databases

Dick Repasky bioclusters@bioinformatics.org
Wed, 18 Dec 2002 09:59:56 -0500 (EST)

I use a pipeline for updating databases that differs from those used by
other posters in that it defers updates for databases that are in use.  
This works for us because we have a number of users who make sporadic,
intense use of blast rather than users who constantly use blast. The
pipeline consists of three perl scripts: a downloader, a formatter and an
installer.  Passed along the pipeline are names of databases.  Because the
process is a pipleline, some degree of parallelism is achieved.

The downloader determines what's to be downloaded based on time stamps and
file sizes. If a database is in use when it's turn comes up to be
downloaded, it is dropped from the list of downloads. (On the possibly-do
list is the idea of re-queueing the db at the back of the list so that it
will be downloaded if the user finishes with the db before the evening
downloading activity finishes, but user processes typically outlive the
downloading process.)  Otherwise, the db is downloaded.  Upon completion,
the size of the db is checked, and the db is dropped from the pipeline if
the size differs from that at the source. Otherwise, the name of the file
is passed on to the formatter, and the downloader moves on to the next db.

The formatter uncompresses files, formats them and checks for errors.  
db's are dropped and errors are printed if the return code from
decompression is greater than zero.  Otherwise, the db is formatted.  
After formatting, the log is checked to be sure that all lines begin with
"^====", "^Version", "^Started", or "^Formatted".  Databases passing the 
test are passed on to the installer.  Those failing the test are dropped 
from the pipeline.

The installer, sets ownerships and permissions, checks to see if the db is 
in use, and if not moves the db into place.  This step if very quick and 
has been set up as a separate step in the pipeline just in case we are 
ever in a situation in which formatting takes place on a file system other
than than on which the db's are installed.

The pipeline has been performing well for a few months.

Dick Repasky


Dick Repasky
Bioinformatics Support
UITS Cubicle 101.08
Indiana University