Hi, I think the best way to approach this is to cluster the hits linearly on the genome, create "genes", extract them and run the BLAST again with the -v option. BTW, that does not seem to be a bioclusters Q. Eitan -------------------- Eitan Rubin, PhD Head of Bioinformatics The Bauer Center for Genomics Research Harvard University Tel: 617-496-5649 Fax: 617-495-2196 -----Original Message----- From: bioclusters-request at bioinformatics.org [mailto:bioclusters-request at bioinformatics.org] Sent: Thursday, March 24, 2005 12:06 PM To: bioclusters at bioinformatics.org Subject: Bioclusters Digest, Vol 5, Issue 24 Send Bioclusters mailing list submissions to bioclusters at bioinformatics.org To subscribe or unsubscribe via the World Wide Web, visit https://bioinformatics.org/mailman/listinfo/bioclusters or, via email, send a message with subject or body 'help' to bioclusters-request at bioinformatics.org You can reach the person managing the list at bioclusters-owner at bioinformatics.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Bioclusters digest..." Today's Topics: 1. Re: Problems with a large query sequence in BLAST (Jan van Haarst) 2. Re: Problems with a large query sequence in BLAST (Aaron Darling) ---------------------------------------------------------------------- Message: 1 Date: Thu, 24 Mar 2005 10:38:48 +0100 From: Jan van Haarst <jvhaarst at gmail.com> Subject: Re: [Bioclusters] Problems with a large query sequence in BLAST To: bioclusters at bioinformatics.org Cc: Ian Korf <iankorf at mac.com> Message-ID: <b3209c4805032401387668a830 at mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 On Thu, 17 Mar 2005 07:37:40 -0800, Ian Korf <iankorf at mac.com> wrote: > What genome does the BAC come from? What are you trying to do exactly? The data are from tomato and potato, and as there is no way to predicts genes well, we use blast to get a first rough look at the data. > You didn't answer that. By the way, there's a really good book on BLAST > from O'Reilly & Associates that discusses these issues in great detail. I know of your book, just haven't had a chance to buy & read it yet. Maybe I should explain myself better so you all can help me better. What we try to do is get a rough idea of what genes are present on an newly sequenced and assembled BAC. The normal way would be to use gene prediction software to predict the genes, and blast those genes. But because there aren't good models (yet) for these genomes, we need another way to get a quick look. When one BLASTs a large query, in our case 65K, the probability of hitting a well preserved gene is large. And as those genes will give a lot of hits, the rest of the genes will not show up, unless you set the number of hits to show very high. But setting the number of results high makes the end-user unhappy, as they will have to wade through a lot of the same data to see the more interesting bits. What I would like is a method to limit the number of hits per region, so for every hit you inly see the first 10 or so. NCBI BLAST has such an option (-K), but as I already said, it doesn't and apperently never will work. I haven't been able to find a solution yet, maybe somebody can point me in the right direction ? -- With kind regards, Jan ------------------------------ Message: 2 Date: Wed, 23 Mar 2005 09:30:58 -0600 From: Aaron Darling <darling at cs.wisc.edu> Subject: Re: [Bioclusters] Problems with a large query sequence in BLAST To: jan at vanhaarst.net, "Clustering, compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org> Message-ID: <42418BB2.4010200 at cs.wisc.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed This sounds more like a job for a global genome aligner than for BLAST. LAGAN and MAVID are great if you are certain there are no rearrangements in your data. Shuffle-LAGAN will handle pairs of possibly rearranged sequences. Mauve (my own project) will do multiple rearranged sequences and comes with a visualization component. LAGAN et. al: http://lagan.stanford.edu/lagan_web/index.shtml Mavid: http://baboon.math.berkeley.edu/mavid/ Mauve: http://gel.ahabs.wisc.edu/mauve -Aaron Jan van Haarst wrote: >On Thu, 17 Mar 2005 07:37:40 -0800, Ian Korf <iankorf at mac.com> wrote: > > > >>What genome does the BAC come from? What are you trying to do exactly? >> >> >The data are from tomato and potato, and as there is no way to >predicts genes well, we use blast to get a first rough look at the >data. > > > >>You didn't answer that. By the way, there's a really good book on BLAST >>from O'Reilly & Associates that discusses these issues in great detail. >> >> > >I know of your book, just haven't had a chance to buy & read it yet. > >Maybe I should explain myself better so you all can help me better. > >What we try to do is get a rough idea of what genes are present on an >newly sequenced and assembled BAC. The normal way would be to use gene >prediction software to predict the genes, and blast those genes. >But because there aren't good models (yet) for these genomes, we need >another way to get a quick look. > >When one BLASTs a large query, in our case 65K, the probability of >hitting a well preserved gene is large. And as those genes will give >a lot of hits, the rest of the genes will not show up, unless you set >the number of hits to show very high. >But setting the number of results high makes the end-user unhappy, as >they will have to wade through a lot of the same data to see the more >interesting bits. > >What I would like is a method to limit the number of hits per region, >so for every hit you inly see the first 10 or so. NCBI BLAST has such >an option (-K), but as I already said, it doesn't and apperently never >will work. > >I haven't been able to find a solution yet, maybe somebody can point >me in the right direction ? > > > ------------------------------ _______________________________________________ Bioclusters maillist - Bioclusters at bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters End of Bioclusters Digest, Vol 5, Issue 24 ******************************************