[Bioclusters] distributed blasting of genomes and WASHU blast

Tim Harsch bioclusters@bioinformatics.org
Wed, 12 Feb 2003 15:43:03 -0800

----- Original Message -----
From: "Joseph Landman" <landman@scalableinformatics.com>
To: <bioclusters@bioinformatics.org>
Sent: Tuesday, February 11, 2003 8:26 PM
Subject: Re: [Bioclusters] distributed blasting of genomes and WASHU blast

> Tim:
>   1)  see http://blast.wustl.edu/blast/README.html#Tofly and look for
> hspmax= (among others)
>   2)  Is your database an assembled genome?  E.g.  1 sequence/chromosome
Yep.  About 5 Mb.
> or similar sized entity?  If so, you might look at splitting the
> database into smaller sequences by low complexity, or various length
> overlapping segments.  It depends upon what information you are trying
> to get at.

I never thought of low complexity.  I've repeat masked the sequence and
could split it on some repeat regions.  I guess it's obvious I am a
bioinformaticist coming from a CS background, and lack the bio
understanding.  My question is:  Even a valid hit can have some repeat in
it, so wouldn't I still have the same problem?  Like in this case, there are
certain very long regions that are masked out so I feel comfortable that no
hit would encompass the region and thus make it a bad place to split on.
However, I'm after a generalized solution that doesn't require special
knowledge of the sequences.  Do you think the method I outlined below would

> Joe
> Tim Harsch wrote:
> >Two questions here (the quick one first):
> >    1)    How do you tell WASHU blast to return more than 1000 hits when
> >using tblastx?
> >
> >    2)    If I have two large genomes that need a lengthy blast, how can
> >split that up?
> >
> >Just considering an SMP machine for now, perhaps SGE later..  As we know
> >threading is not as effective as individual blasts.  In my case, with one
> >genome as the database and one as the query, WASHU blast is never using
> >than one thread so no parallelism is achieved.  I'm thinking that I could
> >take my query sequence split it into X parts and blast one part per CPU
> >then what about the boundaries between sequences as possible hits?  If I
> >want to assume no before-hand knowledge of the genome here, I'm thinking
> >could process the results from the X parts, find the stop base of the
> >hit on the X-1 part, call it A, and the start base of the first hit of
the X
> >part, call it B, and create a subsequence from A to B from the original
> >database sequence, repeat for all boundarties of the X parts, then blast
> >these new subsequences against the database then union the hits from this
> >with hits from the X parts.
> >
> >If I'm correct, using this method my e-values would even be the same than
> >I had done a simple one-on-one comparison, because my database never
> >changes.
> >
> >Does this sound reasonable?  Even so, if there is an easier method then I
> >sure would like to hear it.
> >
> >Ciao,
> >
> >Tim Harsch
> >Computer Scientist
> >Lawrence Livermore National Laboratory
> >
> >_______________________________________________
> >Bioclusters maillist  -  Bioclusters@bioinformatics.org
> >https://bioinformatics.org/mailman/listinfo/bioclusters
> >
> >
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman@scalableinformatics.com
> web  : http://scalableinformatics.com
> phone: +1 734 612 4615
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters