[Bioclusters] Problems with a large query sequence in BLAST

Wed Mar 16 09:28:05 EST 2005

Dear All,

The interesting problem is this :

When we blast a large query (in our case a BAC), against nr, and in
that query happens to be a lot of high scoring sequences like
transposon elements, blast only gives us those results when we ask for
the top 100 scoring hsp's. And no results of the rest of the query are
shown.

This is of course expected behaviour, because that is how thresholds
work, if you want to see it all, you have to ask for it all. But if
you ask for everything, your output becomes huge, burying the
interesting hits in a big pile of nonsense..

It is of course solvable by asking for a lot, say 3 million hsp's, and
then parse that output to get the result you want. An other option
would be to mask and re-iterate. The first method has the problem that
"asking for a lot" is rather arbitrary, the second that you might mask
too much on your first iteration.

I was just wondering if other people have had the same problem, and of
course I'm even more interested in how they solved it.

With kind regards,
Jan

P.S.

The -K option in blast ( -K Number of best hits from a region to keep
) doesn't work, and we got word from the NCBI that it would be dropped
in future versions.