> > >From: "Chris Dwan (CCGB)" <cdwan@mail.ahc.umn.edu> >Do you have a feeling for where this anti-cycle-stealing attitude comes >from? Like Chris Dagdigian said, it sounds like you've got benchmarking >well in hand, and there are a wide variety of examples of folks using >production workstations to augment their dedicated clusters. If the >concerns are reliability, security, performance impact, or other technical >things, those can be worked with numbers and tests. Chris Our system actually uses dual OS desktops, so its really an ad-hoc cluster/farm. The primary OS is WindowsXP which is used during normal working hours but is not actually used within our bioinformatics pipeline so we are not actually cycle-stealing much as I would like to do that. When staff leave at night they simply reboot there desk tops (Linux - RH8.0 is the default OS on these machines) so the staff member happily goes home and their machine boots to Linux, synchronises itself and joins our pipeline. So its not really cycle stealing just utilisation outside normal hours. Apologies if I did not make this clear initially. The key issues are therefire OS management issues - TOC (need for a second hard disk, time to install second OS, how to roll out updates for security patches etc). > >If, on the other hand, it's concern over trying something new, the Engine >system recently implemented at Novartis is a decent example of a >corporation gaining a great deal of horsepower this way. Can you provide more details on this or point me in the right direction to get more info please? >I will almost never get in the way of someone who wants to go to a >dedicated, cluster system if they need it to get their work done, and >they have the money to spend. A well thought out, centralized resource >will almost always be easier (and cheaper) to administer than a cycle >stealing solution. Agreed. >Pull systems (including SETI@home and friends, as well as Platform's grid >offering) are best suited for cycle stealing and ad-hoc clustering. They >can be really troublesome to debug, since what gets lost in the case of >errors is usually job state, instead of cycles on nodes. Instead >of the cluster operating at less than peak efficiency, it loses jobs. >This can be frustrating for both users and admins. Lost jobs are not really an issue for us. We have processes running to reap jobs that have not completed and reset their availability status. Most of our runs are of durations that jobs lost on nodes that "leave" the pipeline can be reaped and reassigned without holding up the overall pipeline unduly. I have reapers that check the online status of nodes hourly (could be done more frequently) and reset there status where the node has left the farm. Additionally there are a reapers that run based on expected times for completion of different job types and based on the number of jobs left to do in that batch (time between checks decreases as number of jobs decrease) so we tend to catch lost jobs. Efficiency in a pull system does however decrease at the end of a batch run where chunks of jobs are grabbed by the nodes and you have no control over which nodes grab the jobs. If the last few chucks are grabbed by the slowest nodes then the pipeline will have to wait for these to complete before it moves on. To get around this we run multiple pipelines so the nodes are pretty much working all the time processing. We only really have a "single user" in our pipeline which is the automated pipeline control process itself. I appreciate your comments about frustratiosn for users and admins with "pull" approaches where you have multiple users - they definitely could experience delays. ______________________________________________________ The contents of this e-mail are privileged and/or confidential to the named recipient and are not to be used by any other person and/or organisation. If you have received this e-mail in error, please notify the sender and delete all material pertaining to this e-mail. ______________________________________________________