[Bioclusters] SGE vs. LSF
Chris Dagdigian
dag at sonsorol.org
Tue Aug 16 10:17:15 EDT 2005
{ bias alert: I get paid to work on both LSF and SGE and I was once
paid by SunEd to work on their N1GE training materials. I am a
wannabe Grid Engine developer. My company does lots of custom
integration and training work with both products. Neither LSF nor the
SGE people tend to be totally happy when I write these sorts of
emails so take everything I say with the proper amount of caution and
always do your own testing, research and due diligence ... }
I use LSF and SGE both in my work and have done so for many years
now- my specific choice depends on the project, the workflow and the
end-user requirements.
Both LSF and SGE are excellent for typical life science use cases and
workflows. So good that they are the only 2 products that I'll use
professionally. It's not worth dealing with the 2nd tier products any
more when SGE and LSF do what I need exceptionally well.
If forced to give an explanation of the differences between the two
I'd have to say this ...
Platform LSF has the "best" product still among any and all
competition. It is hands down the best overall product if you survey
all the competition with an eye towards advanced features and
functions. *However* things are at a point now in 2005 where the
"stuff" that gives LSF the edge has nothing to do with the core
product (batch scheduling & policy based resource allocation within a
compute farm or cluster...). When it comes to the core work of
distributing jobs and doing resource allocation among a distributed
set of heterogeneous hardware resources then ALL the current products
do a good/excellent job (SGE in particular but also Torque/PBSPro
etc. etc.)
In fact I tend to assume that I'll be using SGE on any new project
with LSF held in the background as an option should the project
demands dictate it.
So if you are looking for a good cluster resource allocation
mechanism then SGE and LSF compare very very favorably. SGE is
improving at an incredibly rapid rate and it just keeps getting
better and better. When you compare on price then SGE is the hands-
down winner since the open source product is free to download and
use. I love SGE and am trying to be a better and more frequent
contributor to the user community. I'm not sure what Sun charges for
the official N1GE 6 product and have no idea how that compares to
Platform's LSF pricing.
The 'stuff' that tends to tip the evaluation equation over towards
the LSF camp are generally layered features and advanced capabilities
that people are willing to pay for once they realize they actually
need them. These features are not *core* things that everyone needs
to care about or use.
Things like:
- Platform LSF ships "for free" a web portal interface that
provides both end-user and LSF-admin functions. Last time I used it,
they were running it as a java/tomcat application server but this may
have changed. The web-portal you get with LSF is far better than any
open source or commercial web front end for Grid Engine.
- Platform LSF has exposed APIs for java, C and webservices
programmers who want to write cluster-aware code and workflows. It is
a bit harder to dig ones claws into the SGE internals (despite having
the source) and the DRMAA stuff is still under heavy development
- Platform LSF ships with layered features that supply things that
one would typically configure (and support) personally within Grid
Engine. Doing this within grid engine is certainly possible but
requires a certain level of SGE expertise and comfort. These include
things like (a) tight integration with parallel environments (MPICH
etc.) and high-speed low latency interconnects like Myrinet and
Infiniband, (b) tight integration with FlexLM license servers. The
layered products cost extra money but Platform LSF will formally
support them and "make them work" which can be important in some
enterprise environments. There are many Platform layered products
that add extra features/functions to the core or base LSF product.
This is probably the main differentiator between LSF and SGE.
- Platform LSF currently has a better reliability/resiliency/fail-
over framework than Grid Engine which is still in the midst of
sorting out it's transition to berkeley-db based spooling mechanisms.
In an LSF cluster the nodes will automatically "elect" a new master
should the current master go down. In SGE you have to configure
qmaster failover hosts and live with some fairly significant
filesystem and RPC server constraints should you want to have fail-
over while using berkley spooling. If you skip berkeley spooling you
can use "classic" spooling and achieve simple failover between master
hosts that share a NFS filesystem. To be fair though, the
"reliability" risk with SGE has more to do with the reliability of
the hardware you use on the qmaster host, as the actual SGE software
is pretty darn solid and robust. Neither LSF nor SGE crash on me so
my "failover" efforts concentrate more on making sure the Linux/
Solaris/Apple OS X server is reliable/available.
- If you want a WAN-scale "real grid" deployment, Platform will
happily sell you (and support) the LSF Multicluster product. To do
this with SGE you'd have to hire consultants or otherwise follow the
footsteps of the groups that are seriously doing hardcore SGE/Globus
integration. It is not trivial and not something I'd recommend for
new SGE admins or users. SGE is fantastic within a LAN or subnet but
things get really complicated as soon as you bring on other grids,
firewalls and remote network links.
- Platform used to have much better documentation but that has
changed. The SGE 6.x documentation collection is actually very good now.
- Platform provides official support on the widest variety of OS
platforms. If they sell it for a platform, they support it on that
platform. Sun may only "officially" support the use of their N1GE
version on Solaris/Linux/Windows. If you need support for SGE on your
OS X system or your SGI Altix box then you need to either do it
yourself or hire one of the third party people/companies that
specialize in this. The SGE mailing list is a fantastic first-pass
resource and there are several companies that can contract SGE
support for you on any platform you can think of. * Warning: I may be
wrong about the scope of Sun's official N1GE support...
- Configuring and managing Platform LSF feels to me as if it
requires "less work" than a similar Grid Engine setup. Advanced SGE
administration and configuration is still relatively undocumented and
even though I've been using it seriously for years now I still learn
new things every week from the masters who converse on the sge-users
mailing list. Many of the techniques and tips they talk about on the
mailing list have never been formally documented or written about
except perhaps as a basic HOWTO or a simple mailing list thread.
Someone still needs to write the "Advanced Grid Engine Administration
& Tuning" book. On the plus side, as someone who does SGE support,
training and integration I tend to get some interesting work out of
this discrepancy!
etc etc.
The main point I'm trying to make is that now in 2005, SGE is hands-
down a serious and equal competitor to LSF. The main reason one would
choose LSF tends to be for the extra layered products and features
that your organization may need that SGE either can't provide in
commercial/supported form or that you yourself no longer want to be
personally responsible for managing and maintaining.
If SGE works fine for you then there is no real cause to switch over
unless during your eval you learn about some layered feature that you
decide you can't live without any more. Also LSF and SGE can coexist
on the same cluster if you want to run them both side-by-side for a
while.
-Chris
On Aug 16, 2005, at 8:19 AM, Richard Wonka wrote:
>
> Hi list,
>
> I am currently running sybyl, glide, flexX/flexS, FeatureTrees,
> LigPrep and moe using the SGE on a couple of dedicated machines
> and some workstations during the off_hours. (all of which are
> running debian sarge)
>
> Now Platform wants me to testdrive LSF and I'm wondering if I
> should put the extra work into a test setup.
>
> I feel that SGE works well for my needs, but then, maybe LSF has
> some major advantage that I'm not aware of?
>
> So:
>
> * What are the major differences between the two and why might I
> want to use LSF instead of SGE?
> * Have any of You experience with either or both systems?
> * If so, What are they and what are You using now?
>
> with Greetings,
>
> Richard
> ______________________________________________________________________
> ___
> Mit der Gruppen-SMS von WEB.DE FreeMail können Sie eine SMS an alle
> Freunde gleichzeitig schicken: http://freemail.web.de/features/?
> mc=021179
>
>
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
More information about the Bioclusters
mailing list