[Bioclusters] Using BLAST to identify coding regions.

Eitan Rubin ERubin at CGR.Harvard.edu
Thu Mar 24 12:35:54 EST 2005


Hi,

  I think the best way to approach this is to cluster the hits linearly on
the genome, create "genes", extract them and run the BLAST again with the -v
option.

BTW, that does not seem to be a bioclusters Q.

  Eitan

--------------------
Eitan Rubin, PhD
Head of Bioinformatics
The Bauer Center for Genomics Research
Harvard University
Tel: 617-496-5649 Fax: 617-495-2196
 

-----Original Message-----
From: bioclusters-request at bioinformatics.org
[mailto:bioclusters-request at bioinformatics.org] 
Sent: Thursday, March 24, 2005 12:06 PM
To: bioclusters at bioinformatics.org
Subject: Bioclusters Digest, Vol 5, Issue 24

Send Bioclusters mailing list submissions to
	bioclusters at bioinformatics.org

To subscribe or unsubscribe via the World Wide Web, visit
	https://bioinformatics.org/mailman/listinfo/bioclusters
or, via email, send a message with subject or body 'help' to
	bioclusters-request at bioinformatics.org

You can reach the person managing the list at
	bioclusters-owner at bioinformatics.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioclusters digest..."


Today's Topics:

   1. Re: Problems with a large query sequence in BLAST (Jan van Haarst)
   2. Re: Problems with a large query sequence in BLAST (Aaron Darling)


----------------------------------------------------------------------

Message: 1
Date: Thu, 24 Mar 2005 10:38:48 +0100
From: Jan van Haarst <jvhaarst at gmail.com>
Subject: Re: [Bioclusters] Problems with a large query sequence in
	BLAST
To: bioclusters at bioinformatics.org
Cc: Ian Korf <iankorf at mac.com>
Message-ID: <b3209c4805032401387668a830 at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, 17 Mar 2005 07:37:40 -0800, Ian Korf <iankorf at mac.com> wrote:

> What genome does the BAC come from? What are you trying to do exactly?
The data are from tomato and potato, and as there is no way to
predicts genes well, we use blast to get a first rough look at the
data.

> You didn't answer that. By the way, there's a really good book on BLAST
> from O'Reilly & Associates that discusses these issues in great detail.

I know of your book, just haven't had a chance to buy & read it yet.

Maybe I should explain myself better so you all can help me better.

What we try to do is get a rough idea of what genes are present on an
newly sequenced and assembled BAC. The normal way would be to use gene
prediction software to predict the genes, and blast those genes.
But because there aren't good models (yet) for these genomes, we need
another way to get a quick look.

When one BLASTs a large query, in our case 65K, the probability of
hitting a well preserved  gene is large. And as those genes will give
a lot of hits, the rest of the genes will not show up, unless you set
the number of hits to show very high.
But setting the number of results high makes the end-user unhappy, as
they will have to wade through a lot of the same data to see the more
interesting bits.

What I would like is a method to limit the number of hits per region,
so for every hit you inly see the first 10 or so. NCBI BLAST has such
an option (-K), but as I already said, it doesn't and apperently never
will work.

I haven't been able to find a solution yet, maybe somebody can point
me in the right direction ?

-- 
With kind regards,
Jan


------------------------------

Message: 2
Date: Wed, 23 Mar 2005 09:30:58 -0600
From: Aaron Darling <darling at cs.wisc.edu>
Subject: Re: [Bioclusters] Problems with a large query sequence in
	BLAST
To: jan at vanhaarst.net, "Clustering,	compute farming & distributed
	computing in life science informatics"
	<bioclusters at bioinformatics.org>
Message-ID: <42418BB2.4010200 at cs.wisc.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

This sounds more like a job for a global genome aligner than for BLAST.

LAGAN and MAVID are great if you are certain there are no rearrangements 
in your data.
Shuffle-LAGAN will handle pairs of possibly rearranged sequences.
Mauve (my own project) will do multiple rearranged sequences and comes 
with a visualization component.

LAGAN et. al:  http://lagan.stanford.edu/lagan_web/index.shtml
Mavid:   http://baboon.math.berkeley.edu/mavid/
Mauve:   http://gel.ahabs.wisc.edu/mauve

-Aaron


Jan van Haarst wrote:

>On Thu, 17 Mar 2005 07:37:40 -0800, Ian Korf <iankorf at mac.com> wrote:
>
>  
>
>>What genome does the BAC come from? What are you trying to do exactly?
>>    
>>
>The data are from tomato and potato, and as there is no way to
>predicts genes well, we use blast to get a first rough look at the
>data.
>
>  
>
>>You didn't answer that. By the way, there's a really good book on BLAST
>>from O'Reilly & Associates that discusses these issues in great detail.
>>    
>>
>
>I know of your book, just haven't had a chance to buy & read it yet.
>
>Maybe I should explain myself better so you all can help me better.
>
>What we try to do is get a rough idea of what genes are present on an
>newly sequenced and assembled BAC. The normal way would be to use gene
>prediction software to predict the genes, and blast those genes.
>But because there aren't good models (yet) for these genomes, we need
>another way to get a quick look.
>
>When one BLASTs a large query, in our case 65K, the probability of
>hitting a well preserved  gene is large. And as those genes will give
>a lot of hits, the rest of the genes will not show up, unless you set
>the number of hits to show very high.
>But setting the number of results high makes the end-user unhappy, as
>they will have to wade through a lot of the same data to see the more
>interesting bits.
>
>What I would like is a method to limit the number of hits per region,
>so for every hit you inly see the first 10 or so. NCBI BLAST has such
>an option (-K), but as I already said, it doesn't and apperently never
>will work.
>
>I haven't been able to find a solution yet, maybe somebody can point
>me in the right direction ?
>
>  
>


------------------------------

_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters


End of Bioclusters Digest, Vol 5, Issue 24
******************************************


More information about the Bioclusters mailing list