[Bioclusters] How can I make blast job running short time on Gridengine

Thu, 21 Nov 2002 11:57:45 -0600

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_001_01C29187.7F27C340
Content-Type: text/plain;
	charset="iso-8859-1"

Hi Grace,
    A popular method for running jobs on two different machines at once is
to divide the input into parts, send each part to a different machine, run
to program on each machine using the segment of the input on that machine,
then combining the results. This is what's usually called an embarrassingly
parallel method, where each job has 5 parts:
1) pre-processing (preparing data) on the submission host
2) data transfer to the nodes
3) processing on the execution nodes
4) data transfer back to the submission node, queuing of the results
5) post-processing (combining the results) on the submission host

So, for the case where you are using BLAST as the application, the database
(or query) can be split on sequence boundaries, sent to each of the nodes
for BLASTing, result files sent back to the submission host and combined to
get the final result. This yields about 1/N performance depending on the
efficiency of your configuration, where N is the number of nodes in your
cluster. This would be one way of getting roughly 2x the performance our of
your SGE cluster than what you would get out of a single machine.

    Another nice feature of some Distributed Resource Management (DRM) tools
like (LSF, SGE ...) on clusters is that they do some level of load
balancing. So, if you needed to run a job and needed it to have a fair
chance of getting run with whatever everyone else is doing, the DRM would
figure out which machine will give you the best service for your job. One
nice feature of some schedulers in DRM packages (and they are not all
equal!!) is that each user, group, job... can have a priority placed on it
that will actually preempt other jobs, shuffle queuing... to get the right
resources into the hands of the people who need them most.. 

Combining parallel (embarrassingly parallel) job execution with
scheduling/load-balancing features of DRM tools is really the key to
achieving the efficiency in a cluster that makes if a valuable resource for
doing things like BLAST. 
________________________________________ 

Mike McCardle 
Systems Software Engineer 
RLX Technologies, Inc. 
mike.mccardle@rlxtechnologies.com <mailto:mike.mccardle@rlxtechnologies.com>

 <http://www.rlxtechnologies.com/> http://www.rlxtechnologies.com 

From: bioinfo Gu [mailto:bioinfowistar@yahoo.com]
Sent: Thursday, November 21, 2002 10:08 AM
To: bioclusters@bioinformatics.org
Subject: [Bioclusters] How can I make blast job running short time on
Gridengine

Hi all,

I have two machines to install SGE. athena is master and excution host,
apollo is only excution host. When I submit a job from master host with
qsub, the job will be distributed to one of queue(one of host), and this job
will be executed on this machinefor the whole process. I can not see
Gridengine can save execution time when launch blast job on it.  how can I
save blast running time on Gridengine, do I have to use Parallel
Environment?

Also, how can I setup environmental variable for specific execution host in
batch job script? For example: 

on one excution host: I need BLASTDB point to path1, on the second execution
host, I want to set BLASTDB to path2, how can I do that?

Thank you very much in advance.

Grace

  _____  

Do you Yahoo!?
Yahoo! Mail  <http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com>
Plus - Powerful. Affordable. Sign up
<http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com> now

------_=_NextPart_001_01C29187.7F27C340
Content-Type: text/html;
	charset="iso-8859-1"

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

<META content="MSHTML 5.50.4522.1800" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>Hi 
Grace,</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=754271417-21112002>&nbsp;&nbsp;&nbsp; A popular method for running jobs on 
two different&nbsp;machines at once is to divide the input into parts, send each 
part to a different machine, run to program on each machine using the segment of 
the input on that&nbsp;machine, then combining the results. This is what's 
usually called an embarrassingly parallel method, where each job has 5 
parts:</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>1) 
pre-processing (preparing data) on the submission host</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>2) 
data transfer to the nodes</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>3) 
processing on the execution nodes</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>4) 
data transfer back to the submission node, queuing of the 
results</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>5) 
post-processing (combining the results) on the submission 
host</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=754271417-21112002>So, 
for the case where you are using BLAST as the application, the database (or 
query) can be split on sequence boundaries, sent to each of the nodes for 
BLASTing, result files sent back to the submission host and combined to get the 
final result. This yields about 1/N performance depending on the efficiency 
of&nbsp;your configuration, where N is the number of nodes in your 
cluster.&nbsp;This would be one way of getting roughly 2x the performance our of 
your SGE cluster than what you would get out of a single 
machine.</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=754271417-21112002>&nbsp;&nbsp;&nbsp;&nbsp;Another&nbsp;nice feature of 
some Distributed Resource&nbsp;Management (DRM) tools like (LSF, SGE ...) on 
clusters is that they do some level of load balancing. So, if you needed to run 
a job and&nbsp;needed it to have a fair chance of getting run&nbsp;with whatever 
everyone else is doing, the DRM would figure out which machine will give you the 
best service for your job. One nice feature of some schedulers in DRM packages 
(and they are not all equal!!) is that each user, group, job... can have a 
priority placed on it that will actually preempt other jobs, shuffle queuing... 
to get the right resources&nbsp;into the hands&nbsp;of the&nbsp;people who need 
them most..&nbsp;</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=754271417-21112002></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN 
class=754271417-21112002>Combining parallel (embarrassingly parallel)&nbsp;job 
execution with&nbsp;scheduling/load-balancing&nbsp;features of DRM tools&nbsp;is 
really the key to achieving the efficiency&nbsp;in a cluster that makes if a 
valuable resource for doing things like BLAST. </SPAN></FONT></DIV>
<DIV><SPAN class=754271417-21112002><FONT face=Arial><FONT color=#0000ff 
size=2>________________________________________ </FONT></FONT></DIV>
<DIV>
<P><FONT face=Arial><FONT color=#0000ff><FONT size=2>Mike McCardle 
<BR>Systems&nbsp;<SPAN class=754271417-21112002>Software </SPAN>Engineer <BR>RLX 
Technologies, Inc. <BR><A 
href="mailto:mike.mccardle@rlxtechnologies.com">mike.mccardle@rlxtechnologies.com</A> 
<BR></FONT><A target=_blank href="http://www.rlxtechnologies.com/"><FONT 
size=2>http://www.rlxtechnologies.com</FONT></A><FONT size=2> 
</FONT></FONT></FONT></P>
<P></SPAN><FONT face=Tahoma size=2><B>From:</B> bioinfo Gu 
[mailto:bioinfowistar@yahoo.com]<BR><B>Sent:</B> Thursday, November 21, 2002 
10:08 AM<BR><B>To:</B> bioclusters@bioinformatics.org<BR><B>Subject:</B> 
[Bioclusters] How can I make blast job running short time on 
Gridengine<BR><BR></FONT></P></DIV>
<BLOCKQUOTE>
  <P>Hi all,</P>
  <P>I have two machines to install SGE. athena is master and excution host, 
  apollo is only excution host. When I submit a job from master host with qsub, 
  the job will be distributed to one of queue(one of host), and this job will be 
  executed on this machinefor the whole process. I can not see Gridengine can 
  save execution time when launch blast job on it.&nbsp; how can I save blast 
  running time on Gridengine, do I have to use Parallel Environment?</P>
  <P>Also, how can I setup environmental variable for specific execution host in 
  batch job script? For example: </P>
  <P>on one excution host: I need&nbsp;BLASTDB point to path1, on the second 
  execution host, I want to set&nbsp;BLASTDB to path2, how can I do that?</P>
  <P>Thank you very much in advance.</P>
  <P>Grace</P>
  <P><FONT face=Arial color=#0000ff size=2></FONT><FONT face=Arial color=#0000ff 
  size=2></FONT><FONT face=Arial color=#0000ff size=2></FONT><FONT face=Arial 
  color=#0000ff size=2></FONT><FONT face=Arial color=#0000ff size=2></FONT><BR>
  <HR SIZE=1>
  Do you Yahoo!?<BR><A 
  href="http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com">Yahoo! Mail 
  Plus</A> - Powerful. Affordable. <A 
  href="http://rd.yahoo.com/mail/mailsig/*http://mailplus.yahoo.com">Sign up 
  now</A></BLOCKQUOTE></BODY></HTML>

------_=_NextPart_001_01C29187.7F27C340--