[Bioclusters] Request for discussions-How to build a biocluster Part 5 (BLAST/DB management)

Imre Vastrik bioclusters@bioinformatics.org
Fri, 03 May 2002 09:53:33 +0100


Sylvain Foisy wrote:

> BLAST
> 
> OK, which version of BLAST should we use: NCBI or WU? I have used both
> and quite franckly for most uses, they are pretty much equal although WU
> seems to be faster. Any particular feature from any of these that could
> be helpful to specific users?

NCBI blast gives XML and tab-delimited output which can make your
"parsing-life" slightly easier. Also, for whatever reasons I've never
managed to make WU blast run faster than NCBI's one, but perhaps this
has something to do with me ;). The main thing is to use "fresh"
versions. Both blasts have undergone significant speed improvements over
past few years.

> THE GENBANK DATABASE
> 
> BLAST without the data, what for? OK, what sould be downloaded: the
> GenBank database in its own format or the FASTA transformed one that is
> found in tha BLAST folder at NCBI? In both cases it is a lot of data.
> The idea would be for a user to get the whole GenBank record for a
> particular sequence. However, I think that it could be done either way
> with scripts.

Can't comment much on GenBank since I'm using EMBL (how else? ;)). Since
I'm splitting the db by species and by sequence type (mRNA/cDNA,
finished genomic, HTGS, etc, i.e. "finer" split that is readily
available) I'm parsing the EMBL flatfiles.

> How should the local database be administered? Reading the archive, I
> think that the consensus is that the DB has to be splitted in n pieces
> (n=nb of nodes), each piece sent to a particular node, process with
> formatdb. Or have I everything wrong? I would be worried that the nodes
> which are getting the human sequences or the EST sequences be very hard
> working while the ones with the vector sequences are idle. Is it
> feasible to divide the DB to split the load over the nodes?

I don't have 1st hand experience with blastdbs on clusters (my stuff is
running on a multiprocessor machine) but I would distribute all dbs on
all nodes so that you'd avoid the issue of some nodes being more heavily
loaded than others due to different "popularity" of dbs.

> How should the daily updates be performed?

Again, as I'm familiar only with EMBL stuff I can talk only about that
(although GB operates probably in a very similar manner). EMBL DNA DBs
come in following forms: release aka "embl" and everything since the
last release aka "emblnew". For the latter EBI's ftp site offers:
-"cumulative" data, i.e. everything since the last release except the
records changed/deleted
-weekly updates
-daily (well, near daily) updates.
For the latter 2 there are also transaction lists which you can use to
create the cumulative version locally.

For a given specie/sequence type combination I create 3 blast databases:
-release
-new (everything since release, including the latest)
-latest (the last daily/weekly update)

Users who do their searches regularly (i.e. with each daily/weekly
update) do it on the "latest". (Obviously it would be dead handy to have
a way of launching these searches autonatically whenever the db is
updated...).
"Occasional" users would search the union of "release" and "new". NCBI
blast allows you to create alias files listing the "real" blastdbs to
use which means that the user does not have to know anything about
release and new etc and can just search "everything".

Rgds.,

imre

P.S. you can see the web front to the blast server I've been talking
about at:
http://biomedicum.csc.fi:8010