[Bioclusters] NCBI updates and how you do them
jason.calvert at novartis.com
jason.calvert at novartis.com
Sat Apr 30 10:23:08 EDT 2005
We have many databases to distribute so I have written a few scripts to
benchmark different distribution methods, udpcast seems to work best for
us currently. I have a script that checks the size of each file in the
database directory on each node, and them makes a list of files that need
to be udpcast-ed from the master copy on the distribution node and to
which nodes to cast the file to. It then starts up a listener on the
appropriate nodes for each file and sends it out. Since UDP cast is slow
for a smaller number of clients, it also checks to see how many clients
need the file, and if it is smaller than your set break point, it uses NFS
(could be a command line switch to rsync) to distribute the file at the
same time as the udpcast is going on. There is also a setpoint for the
filesize to decide whether to use udpcast or NFS.
We use filesize as an indicator as it takes 4 hours just to do the
checksum on all our files each night, and this time will be growing with
our databases. This could be easily made a command line switch to choose
what method to use. udpcast has different data checking in it's protocol
to cover for UDP.
I have also written a script that uses a treed rsync to distribute the
data, but rsync was using way too much overhead with the size of our
databases, and these will be growing.
I was planning on updating the script to do checksums weekly, but I found
a problem with our kernel I had to solve first. I will be starting to
develop the scripts again this coming week.
Is anybody interested in such a project?
Well I am out of wind,
Jason
Jan van Haarst <jvhaarst at gmail.com>
Sent by:
bioclusters-bounces+jason.calvert=pharma.novartis.com at bioinformatics.org
04/30/2005 03:55 AM
Please respond to jan; Please respond to "Clustering, compute farming &
distributed computing in life science informatics"
To: jeremy at biochem.uthscsa.edu, "Clustering, compute farming & distributed
computing in life science informatics" <bioclusters at bioinformatics.org>
cc: (bcc: Jason Calvert/PH/Novartis)
Subject: Re: [Bioclusters] NCBI updates and how you do them
Hi,
On our cluster we use UDPcast ( http://udpcast.linux.lu/ ) to push the data to the nodes, and rsync afterwards to double check the
transfer.
The way I understood it, rsync and the (non FASTA) blast databases don't
work well together, you end up sending the complete database through
rsync, which isn't the best solution if you want to push data to a lot of
nodes at the same time.
But maybe that isn't the case anymore, what do you see when you update the
database through rsync ?
UDPcast works by broadcasting the data to the nodes, on which listeners
pick up the data.
There are other ways to distribute data form one to many, but UDPcast
works fine for us.
Kind regards,
Jan
2005/4/26, Jeremy Mann jeremy at biochem.uthscsa.edu:
Is rsync the way to push to all nodes? If not, what other alternatives
exist?
--
Jeremy Mann
jeremy at biochem.uthscsa.edu_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://bioinformatics.org/pipermail/bioclusters/attachments/20050430/6b6bd890/attachment.htm
More information about the Bioclusters
mailing list