We have many databases to distribute so I have written a few scripts to benchmark different distribution methods, udpcast seems to work best for us currently. I have a script that checks the size of each file in the database directory on each node, and them makes a list of files that need to be udpcast-ed from the master copy on the distribution node and to which nodes to cast the file to. It then starts up a listener on the appropriate nodes for each file and sends it out. Since UDP cast is slow for a smaller number of clients, it also checks to see how many clients need the file, and if it is smaller than your set break point, it uses NFS (could be a command line switch to rsync) to distribute the file at the same time as the udpcast is going on. There is also a setpoint for the filesize to decide whether to use udpcast or NFS. We use filesize as an indicator as it takes 4 hours just to do the checksum on all our files each night, and this time will be growing with our databases. This could be easily made a command line switch to choose what method to use. udpcast has different data checking in it's protocol to cover for UDP. I have also written a script that uses a treed rsync to distribute the data, but rsync was using way too much overhead with the size of our databases, and these will be growing. I was planning on updating the script to do checksums weekly, but I found a problem with our kernel I had to solve first. I will be starting to develop the scripts again this coming week. Is anybody interested in such a project? Well I am out of wind, Jason Jan van Haarst <jvhaarst at gmail.com> Sent by: bioclusters-bounces+jason.calvert=pharma.novartis.com at bioinformatics.org 04/30/2005 03:55 AM Please respond to jan; Please respond to "Clustering, compute farming & distributed computing in life science informatics" To: jeremy at biochem.uthscsa.edu, "Clustering, compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org> cc: (bcc: Jason Calvert/PH/Novartis) Subject: Re: [Bioclusters] NCBI updates and how you do them Hi, On our cluster we use UDPcast ( http://udpcast.linux.lu/ ) to push the data to the nodes, and rsync afterwards to double check the transfer. The way I understood it, rsync and the (non FASTA) blast databases don't work well together, you end up sending the complete database through rsync, which isn't the best solution if you want to push data to a lot of nodes at the same time. But maybe that isn't the case anymore, what do you see when you update the database through rsync ? UDPcast works by broadcasting the data to the nodes, on which listeners pick up the data. There are other ways to distribute data form one to many, but UDPcast works fine for us. Kind regards, Jan 2005/4/26, Jeremy Mann jeremy at biochem.uthscsa.edu: Is rsync the way to push to all nodes? If not, what other alternatives exist? -- Jeremy Mann jeremy at biochem.uthscsa.edu_______________________________________________ Bioclusters maillist - Bioclusters at bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters -------------- next part -------------- An HTML attachment was scrubbed... URL: http://bioinformatics.org/pipermail/bioclusters/attachments/20050430/6b6bd890/attachment.htm