[Bioclusters] Cluster Concerns to consider (was Re: [Bioclusters] Clusters for bioinformatics... Some numbers or statistics?) bioinformatics... Some numbers or statistics?)

Jennifer Steinbachs stein@fieldmuseum.org
Thu, 30 Aug 2001 14:44:44 -0500 (CDT)

On Thu, 30 Aug 2001, Ivo Grosse wrote:


 >My question is: what would be the advantage of having the beowulf

I just helped compile a list of questions to begin a discussion at a site
that installed a cluster.  An NSF grant covered the original installation
and two years of staff support but there has been little to no discussion
(that I know of) about on-going support.  Further, there is no full-time
highly skilled unix sys admin on the premises - most of the computing is
still VERY desktop-centric.  Below is the list of questions my husband and
I suggested that the full-time staff (faculty and IT tech staff) need to
address - just as a place to start a discussion about the (perceived)
success of the project and its continued support.

Actually, these questions should have been addressed before we were ever
brought into the project.  They don't include the obvious questions that
also need to be asked when one is building from scratch:
 - is there space for a cluster? power? adequate networking? proper fire
suppression? adequate cooling? backup power?
 - what kind of equipment? (build your own? buy a cluster solution?)
 - someone to install it?

Political Questions
  - Is the cluster meeting its original goals?  If it is, where do we
    go from here?  If it isn't, should we improve our existing
    facility, or seek alternative computing strategies? (e.g
    high performance desktop workstations, given the types of
    applications currently in use)
  - Who (if any individual or political body) has ultimate technical
    authority over cluster administration and related decisions?
  - Where is the distinction between technical staff input and
    researcher input?  Have we made this clear to both sides?
  - Who is authorized to use the cluster?  Should former cluster
    users no longer associated with the institution be allowed to access
    the cluster indefinitely?
  - Given that we consider long term support an issue, how will we
    fund it?
  - Is it important to better advertise the availibility of the cluster?
  - Should we be approaching various entities to become donors,
    specifically for cluster-related research?  Who would be responsible
    for this?
  - Should we write another grant for additional support now?  Do
    we expect to be writing another one for future support?  Do we
    anticipate writing another grant for an entirely new cluster?
  - How does our cluster compare to similar installations at peer
    institutions?  Is this really a concern?

Support Questions
  - How is the cluster being used by our users, and is it meeting their
    current needs (support included)?
  - Do we have statistics?  What kind of statistics should we be
    collecting? (e.g program usage, disk usage, etc.)
  - Does our existing staff have the expertise required to maintain
    our cluster in its present form?  Would we be prepared if we had
    to add an additional 10 nodes within a few months?
  - How do we provide training to new users?  How do we train experienced
    users that want to explore new analytical methods?  How do we train
    our computing staff to meet these types of demands?
  - Who is maintaining the local and external cluster security access (e.g
    the machine that is accessible through the firewall) to ensure our
    accessibility policy is being met?
  - Who maintains the cluster website and related resources?
  - Who is responsible for doing routine system administration tasks
    (e.g adding/removing users, backups, maintaining physical hardware,
    staying up-to-date on OS patches, etc.)?
  - Who is responsible for maintaining custom user-requested software
    on the cluster?  What if certain software doesn't compile (or is
    otherwise unusable with our hardware) but is required?

Technical Questions
  - What are our immediate technical support needs?  Have we been
    addressing day-to-day issues in an effective manner?
  - When home directory storage on the NFS server reaches its physical
    limit (as the cluster becomes more popular), do we implement and
    enforce per-user disk quotas?  For example, should each user be
    restricted to 1GB of disk space?
  - For security reasons, do we disable accounts after a certain
    period of inactivity? (how do we define inactive... fits with
    political accessibility policy)
  - Should we be providing long-term backups of user data on the cluster?
    And if so, would it be better to centralize backups to reduce
    support burden (e.g. instead of having multiple independent tape
    drives and scripts per machine)
  - What new technologies are available to improve cluster
    performance and usability?
  - Should our technical staff have a small test cluster for training
    and R&D of new technologies?
  - Are we comfortable with our existing support contracts and are
    we taking advantage of all that they offer?
  - If we have to buy new hardware today, what vendor(s) do we
    contact?  How do we deal with the integration of hardware from
    multiple vendors?

J. Steinbachs, Ph.D.
Computational Biologist