[Bioclusters] Problems with a large query sequence in BLAST

Thu Mar 24 04:38:48 EST 2005

On Thu, 17 Mar 2005 07:37:40 -0800, Ian Korf <iankorf at mac.com> wrote:

> What genome does the BAC come from? What are you trying to do exactly?
The data are from tomato and potato, and as there is no way to
predicts genes well, we use blast to get a first rough look at the
data.

> You didn't answer that. By the way, there's a really good book on BLAST
> from O'Reilly & Associates that discusses these issues in great detail.

I know of your book, just haven't had a chance to buy & read it yet.

Maybe I should explain myself better so you all can help me better.

What we try to do is get a rough idea of what genes are present on an
newly sequenced and assembled BAC. The normal way would be to use gene
prediction software to predict the genes, and blast those genes.
But because there aren't good models (yet) for these genomes, we need
another way to get a quick look.

When one BLASTs a large query, in our case 65K, the probability of
hitting a well preserved  gene is large. And as those genes will give
a lot of hits, the rest of the genes will not show up, unless you set
the number of hits to show very high.
But setting the number of results high makes the end-user unhappy, as
they will have to wade through a lot of the same data to see the more
interesting bits.

What I would like is a method to limit the number of hits per region,
so for every hit you inly see the first 10 or so. NCBI BLAST has such
an option (-K), but as I already said, it doesn't and apperently never
will work.

I haven't been able to find a solution yet, maybe somebody can point
me in the right direction ?

-- 
With kind regards,
Jan