----- Original Message ----- From: "Joseph Landman" <landman@scalableinformatics.com> To: <bioclusters@bioinformatics.org> Sent: Tuesday, February 11, 2003 8:26 PM Subject: Re: [Bioclusters] distributed blasting of genomes and WASHU blast > Tim: > > > 1) see http://blast.wustl.edu/blast/README.html#Tofly and look for > hspmax= (among others) Thanks! > > 2) Is your database an assembled genome? E.g. 1 sequence/chromosome Yep. About 5 Mb. > or similar sized entity? If so, you might look at splitting the > database into smaller sequences by low complexity, or various length > overlapping segments. It depends upon what information you are trying > to get at. I never thought of low complexity. I've repeat masked the sequence and could split it on some repeat regions. I guess it's obvious I am a bioinformaticist coming from a CS background, and lack the bio understanding. My question is: Even a valid hit can have some repeat in it, so wouldn't I still have the same problem? Like in this case, there are certain very long regions that are masked out so I feel comfortable that no hit would encompass the region and thus make it a bad place to split on. However, I'm after a generalized solution that doesn't require special knowledge of the sequences. Do you think the method I outlined below would work? > > Joe > > Tim Harsch wrote: > > >Two questions here (the quick one first): > > 1) How do you tell WASHU blast to return more than 1000 hits when > >using tblastx? > > > > 2) If I have two large genomes that need a lengthy blast, how can I > >split that up? > > > >Just considering an SMP machine for now, perhaps SGE later.. As we know > >threading is not as effective as individual blasts. In my case, with one > >genome as the database and one as the query, WASHU blast is never using more > >than one thread so no parallelism is achieved. I'm thinking that I could > >take my query sequence split it into X parts and blast one part per CPU but > >then what about the boundaries between sequences as possible hits? If I > >want to assume no before-hand knowledge of the genome here, I'm thinking I > >could process the results from the X parts, find the stop base of the last > >hit on the X-1 part, call it A, and the start base of the first hit of the X > >part, call it B, and create a subsequence from A to B from the original > >database sequence, repeat for all boundarties of the X parts, then blast > >these new subsequences against the database then union the hits from this > >with hits from the X parts. > > > >If I'm correct, using this method my e-values would even be the same than if > >I had done a simple one-on-one comparison, because my database never > >changes. > > > >Does this sound reasonable? Even so, if there is an easier method then I > >sure would like to hear it. > > > >Ciao, > > > >Tim Harsch > >Computer Scientist > >Lawrence Livermore National Laboratory > > > >_______________________________________________ > >Bioclusters maillist - Bioclusters@bioinformatics.org > >https://bioinformatics.org/mailman/listinfo/bioclusters > > > > > > -- > Joseph Landman, Ph.D > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://scalableinformatics.com > phone: +1 734 612 4615 > > > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters