[Bioclusters] download blast db with rsync in uncompressed form at

Michael James bioclusters@bioinformatics.org
Wed, 3 Dec 2003 13:00:48 +1100

On Wednesday 03 December 2003 00:22, Fabien Steinmetz wrote:
> Le Lundi 1 D=E9cembre 2003 18:14, elijah wright a =E9crit :
> > > > in fact rsync can't be used at its "best performances" because the
> > > > databases are already compressed.
> Of course the transmitted data is less than the size of the file, however
> it's very near the size of the file.

Unfortunately that's not necessarily true.

I've got an arrangement with my upstream database provider,
 to get rsync access to the databases
 and often see speedups of LESS than 1.

Rsync is a fantastic tool and stands to benefit us all,
but for it to work well we require some changes upstream.

Considering FASTA databases first,
 updates seem to come out with the entries in a new order.
Even before any new info is added, permuting entries
 will break any chance of rsync helping.

New entries seem to be added randomly through the file.
Rsync considers the file in blocks, if most blocks have changed,
 all rsync can do is ADD the overhead
 of exchanging block by block checksums.
If updates left the beginning as unchanged as possible,
 appending new entries, then rsync would work well.

Once the files are compressed, the compression needs to be considered.
Rusty has written some patches to gzip to add a --rsyncable option.
These periodically flush the compression codebook, meaning that changes
 to an early part of the file will not change the entire compressed file.
Again, if the original file pushed changes to the end, it wouldn't matter.

By the time the file has been indexed (formatdb) and tarred,
 it might as well be feathered, all is lost.
=46or most of my databases,
 I prefer to go back to NCBI and get the FASTA version.
Anyway, it is different (useable) in some cases (est_*)

My .02,

PS: Don't get me started on the need to
	put comments into fasta files with version info.
ie:	# est_others.fasta, generated by <institute> on <date>

Michael James                         michael.james@csiro.au
System Administrator                    voice:  02 6246 5040
CSIRO Bioinformatics Facility             fax:  02 6246 5166