[Bioclusters] Nightly updated BLAST databases

Jeremy Mann bioclusters@bioinformatics.org
Mon, 16 Dec 2002 19:51:12 -0600 (CST)

> Uncompressing large files and building blast databases takes time.
> Enough time that you usually don't want to be creating new databases in
> the same directory where users may be launching query searches against.
> In a previous life several years ago this is what I did, nothing fancy
> at all -- basically I kept a mirror volume + 2 blast database volumes
> (one for the current DB and the 2nd for the 'new' current DBs) and
> switched symbolic links between them:

I was thinking the exact same thing today. Two directories, one for the
formatted, working database the other for uncompressing and formatting the
new downloads. The problem I am encountering is that I am rsyncing (from
biomirrors.net...THANK YOU BIOMIRRORS!!!) so I have to keep that directory
just so I avoid redownloading the entire blast/ directory. But I like your
3 directory method. I'll see what I can come up with tomorrow.

> (3) run simple quality control script(s) to make sure a failed download
> or truncated file is not going to ruin my day. There are little things
> you can check like making sure that new database is the same size or
> larger than the old database etc. etc. This can be as simple or as
> complex as your computing demands require.
> (4) if all looks good and there are no blast jobs running then change
> the symbolic link(s) so that your newly built databases in volume B are
> the ones that people use when a search is fired off. As before the
> methods for figuring out 'are there any searches running' can be as
> complex or a simple as the production environment demands

If you don't mind me asking, how do you do this? How do I control when and
if the BLAST jobs are running? I would think there would have to be some
sort of manual control. Here is what I think I need to do:

1. Run rsync from crontab (already done)
2. Custom script to see if rsync is still running. If so, stop, if not run
   2nd script, after an hour checks if rsync is still running. I am
confused as to how to pull this off. If I run it from crontab, I would
need to add some sort of check to see if 1st script is running, if so,
don't run again until next day.
3. 3rd script runs uncompress | formatdb into another directory. I got
this one in place.
4. 4th script resymlinks db/ from blast/ directory. Need to add a few if
statements to see if 3rd script is still running and check for existing
blast jobs.

It seems to me each script needs to have its own internal checking process
to see if other scripts are running at that time. If running from crontab,
this just complicates things. So each script is run manually?

> All in all I've seen many systems where end users, pipelines and power
> users are pretty tolerant of transient errors so a massive and
> complicated blast updates system was not considered crucial. Others were
>  small shops or small groups who could enforce a 'nobody does searches
> between 4-6am' rule. Any queries that were 'Of Interest' were searched
> so many times and via so many different mechanisms that having a bullet
> proof 'db integrity protection system' was worth only a moderate amount
> of effort and resources.

In my experience there will always be that one person that disregards to
not run at this time. Or their search runs longer than expected, and if my
script(s) run, their search will be contaminated and produce the results
they weren't expecting. I want to make this as seamless as possible so the
end user doesn't need to abide by rules and run their jobs and expect the
appropriate results.

Thanks for your detailed explanation!

Jeremy Mann

University of Texas Health Science Center
Bioinformatics Core Facility
Phone: (210) 567-2672