[Bioclusters] SGE vs. LSF

Chris Dagdigian dag at sonsorol.org
Tue Aug 16 10:17:15 EDT 2005


{ bias alert:  I get paid to work on both LSF and SGE and I was once  
paid by SunEd to work on their N1GE training materials. I am a  
wannabe Grid Engine developer.  My company does lots of custom  
integration and training work with both products. Neither LSF nor the  
SGE people tend to be totally happy when I write these sorts of  
emails so take everything I say with the proper amount of caution and  
always do your own testing, research and due diligence ... }



I use LSF and SGE both in my work and have done so for many years  
now- my specific choice depends on the project, the workflow and the  
end-user requirements.

Both LSF and SGE are excellent for typical life science use cases and  
workflows. So good that they are the only 2 products that I'll use  
professionally. It's not worth dealing with the 2nd tier products any  
more when SGE and LSF do what I need exceptionally well.

If forced to give an explanation of the differences between the two  
I'd have to say this ...

Platform LSF has the "best" product still among any and all  
competition. It is hands down the best overall product if you survey  
all the competition with an eye towards advanced features and  
functions. *However* things are at a point now in 2005 where the  
"stuff" that gives LSF the edge has nothing to do with the core  
product (batch scheduling & policy based resource allocation within a  
compute farm or cluster...).  When it comes to the core work of  
distributing jobs and doing resource allocation among a distributed  
set of heterogeneous hardware resources then ALL the current products  
do a good/excellent job (SGE in particular but also Torque/PBSPro  
etc. etc.)

In fact I tend to assume that I'll be using SGE on any new project  
with LSF held in the background as an option should the project  
demands dictate it.

So if you are looking for a good cluster resource allocation  
mechanism then SGE and LSF compare very very favorably. SGE is  
improving at an incredibly rapid rate and it just keeps getting  
better and better.  When you compare on price then SGE is the hands- 
down winner since the open source product is free to download and  
use.  I love SGE and am trying to be a better and more frequent  
contributor to the user community. I'm not sure what Sun charges for  
the official N1GE 6 product and have no idea how that compares to  
Platform's LSF pricing.

The 'stuff' that tends to tip the evaluation equation over towards  
the LSF camp are generally layered features and advanced capabilities  
that people are willing to pay for once they realize they actually  
need them.  These features are not *core* things that everyone needs  
to care about or use.

Things like:

  - Platform LSF ships "for free" a web portal interface that  
provides both end-user and LSF-admin functions. Last time I used it,  
they were running it as a java/tomcat application server but this may  
have changed. The web-portal you get with LSF is far better than any  
open source or commercial web front end for Grid Engine.

  - Platform LSF has exposed APIs  for java, C and webservices  
programmers who want to write cluster-aware code and workflows. It is  
a bit harder to dig ones claws into the SGE internals (despite having  
the source) and the DRMAA stuff is still under heavy development

  - Platform LSF ships with layered features that supply things that  
one would typically configure (and support) personally within Grid  
Engine. Doing this within grid engine is certainly possible but  
requires a certain level of SGE expertise and comfort. These include  
things like (a) tight integration with parallel environments (MPICH  
etc.) and high-speed low latency interconnects like Myrinet and  
Infiniband, (b) tight integration with FlexLM license servers. The  
layered products cost extra money but Platform LSF will formally  
support them and "make them work" which can be important in some  
enterprise environments.  There are many Platform layered products  
that add extra features/functions to the core or base LSF product.  
This is probably the main differentiator between LSF and SGE.

  - Platform LSF currently has a better reliability/resiliency/fail- 
over framework than Grid Engine which is still in the midst of  
sorting out it's transition to berkeley-db based spooling mechanisms.  
In an LSF cluster the nodes will automatically "elect" a new master  
should the current master go down. In SGE you have to configure  
qmaster failover hosts and live with some fairly significant  
filesystem and RPC server constraints should you want to have fail- 
over while using berkley spooling. If you skip berkeley spooling you  
can use "classic" spooling and achieve simple failover between master  
hosts that share a NFS filesystem. To be fair though, the  
"reliability" risk with SGE has more to do with the reliability of  
the hardware you use on the qmaster host,  as the actual SGE software  
is pretty darn solid and robust.  Neither LSF nor SGE crash on me so  
my "failover" efforts concentrate more on making sure the Linux/ 
Solaris/Apple OS X server is reliable/available.

  - If you want a WAN-scale "real grid"  deployment, Platform will  
happily sell you (and support) the LSF Multicluster product. To do  
this with SGE you'd have to hire consultants or otherwise follow the  
footsteps of the groups that are seriously doing hardcore SGE/Globus  
integration. It is not trivial and not something I'd recommend for  
new SGE admins or users.  SGE is fantastic within a LAN or subnet but  
things get really complicated as soon as you bring on other grids,  
firewalls and remote network links.

- Platform used to have much better documentation but that has  
changed. The SGE 6.x documentation collection is actually very good now.

- Platform provides official support on the widest variety of OS  
platforms. If they sell it for a platform, they support it on that  
platform. Sun may only "officially" support the use of their N1GE  
version on Solaris/Linux/Windows. If you need support for SGE on your  
OS X system or your SGI Altix box then you need to either do it  
yourself or hire one of the third party people/companies that  
specialize in this. The SGE mailing list is a fantastic first-pass  
resource and there are several companies that can contract SGE  
support for you on any platform you can think of. * Warning: I may be  
wrong about the scope of Sun's official N1GE support...

  - Configuring and managing Platform LSF feels to me as if it  
requires "less work" than a similar Grid Engine setup. Advanced SGE  
administration and configuration is still relatively undocumented and  
even though I've been using it seriously for years now I still learn  
new things every week from the masters who converse on the sge-users  
mailing list. Many of the techniques and tips they talk about on the  
mailing list have never been formally documented or written about  
except perhaps as a basic HOWTO or a simple mailing list thread.  
Someone still needs to write the "Advanced Grid Engine Administration  
& Tuning" book. On the plus side, as someone who does SGE support,  
training and integration I tend to get some interesting work out of  
this discrepancy!


  etc etc.


The main point I'm trying to make is that now in 2005, SGE is hands- 
down a serious and equal competitor to LSF. The main reason one would  
choose LSF tends to be for the extra layered products and features  
that your organization may need that SGE either can't provide in  
commercial/supported form or that you yourself no longer want to be  
personally responsible for managing and maintaining.

If SGE works fine for you then there is no real cause to switch over  
unless during your eval you learn about some layered feature that you  
decide you can't live without any more.  Also LSF and SGE can coexist  
on the same cluster if you want to run them both side-by-side for a  
while.


-Chris




On Aug 16, 2005, at 8:19 AM, Richard Wonka wrote:

>
> Hi list,
>
> I am currently running sybyl, glide, flexX/flexS, FeatureTrees,  
> LigPrep and moe using the SGE on a couple of dedicated machines
> and some workstations during the off_hours. (all of which are  
> running debian sarge)
>
> Now Platform wants me to testdrive LSF and I'm wondering if I  
> should put the extra work into a test setup.
>
> I feel that SGE works well for my needs, but then, maybe LSF has  
> some major advantage that I'm not aware of?
>
> So:
>
> * What  are the major differences between the two and why might I  
> want to use LSF instead of SGE?
> * Have any of You experience with either or both systems?
> * If so, What are they and what are You using now?
>
> with Greetings,
>
> Richard
> ______________________________________________________________________ 
> ___
> Mit der Gruppen-SMS von WEB.DE FreeMail können Sie eine SMS an alle
> Freunde gleichzeitig schicken: http://freemail.web.de/features/? 
> mc=021179
>
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>



More information about the Bioclusters mailing list