[Bioclusters] Average Pairwise Scores for reciprocal Blast hits
Chris Dwan
cdwan at bioteam.net
Wed Jun 1 11:20:21 EDT 2005
I would assume that it's a memory problem. Scaling the system memory
to fit the script is an expensive (though not uncommon) way to
approach it.
One simple thing to do would be to impose a threshold at which you
would discard pairwise hits. That would let you get some limited set
of answers, and at the same time verify that the number of members in
the hash is really the problem. So:
while(<FILE>){
my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,
$send, $evalue, $bit)= split /\s+|\t+/, $_;
if ($evalue < 0.000001) { $query_results->{$q}{$h}{evalue}=
$evalue; }
}
The problem with this approach is that there are some edge cases
where one member of the pair of alignments scored below the threshold
and the other did not. In the first approximation, I would count the
number of times that happens. It's probably vanishingly small.
To process the entire dataset, you will probably end up doing
multiple passes on the data.
You could reduce the number of hash references in the problem by
getting rid of the {$evalue}. My starting point would be to do it in
one pass, like this instead. Warning: not syntax checked at
all...consider it pseudocode:
my $threshold = 0.000001
while(<FILE>){
my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,
$send, $evalue, $bit)= split /\s+|\t+/, $_;
if ($evalue < $threshold) {
if (defined($query_results->{$h}{$q}) {
print "$q\t$h\t" . ($query_results->{$h}{$q} + $evalue) / 2 .
"\n";
$query_results->{$h}{$q} = undef;
} else {
$query_results->{$q}{$h} = $evalue;
}
}
}
# Then add code to print out the remaining ones that didn't have
pairwise hits.
A popular way for this to fail would be if you have any non-unique
sequence identifiers in your dataset.
Good luck!
-Chris Dwan
On Jun 1, 2005, at 8:25 AM, Intikhab Alam wrote:
> Hi,
>
> I performed All-vs-All blast for 214777 proteins and wanted to
> cluster these proteins. For clustering, I read that its better
> if we use average pairwise evalues for bi-directional hits.
>
> I wrote a perl script to read blast results, to store hit evalues
> for each query in a hash reference, to find bi-directionabl blast
> hits and calculate average pairwise evalues for them.
> I used hash references since these can be bit fast.
>
>
> I tested the following script on a machine with 2G of ram and on
> another with 94G of ram. It ran out of memory in both cases.
>
> Is it the case that it is not possible to hold this much data into
> a hash?
>
> Any suggestions to resolve the problem?
>
>
> Many Thanks,
>
> Intikhab Alam,
>
>
>
> I use the script as follows:
>
> To Fill the hash:
>
> while(<FILE>){
> my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,
> $send, $evalue, $bit)= split /\s+|\t+/, $_;
> $query_results->{$q}{$h}{evalue}=$evalue;
> }
> close FILE;
>
> To get average pairwise evalues:
>
> my ($c, $hits)=(0, 0);
> open F, ">pairwiseValues";
> foreach my $query (%{$query_results}){
> if (exists ($query_results->{$query})){
> $c++;
> print "Query:$c:$query\n";
> foreach my $hit (%{$query_results->{$query}}){
> if (exists ($query_results->{$query}{$hit}{evalue})){
>
> my $k=$query_results->{$query}{$hit}{evalue};
> if ((exists ($query_results->{$query}{$hit}
> {evalue}) && ($query_results->{$hit}{$query}{evalue}))){
> my $i= $query_results->{$query}{$hit}{evalue};
> my $j= $query_results->{$hit}{$query}{evalue};
>
> $k=(($i+$j)/2);
>
> $query_results->{$query}{$hit}{evalue}=$k;
> $query_results->{$hit}{$query}{evalue}=$k;
> }
> $hits++;
> print F "$query\t$hit\t$k\n";
> }
> }
> }
> }
> close F;
> print "$hits hits read for $c queries \n";
>
> ----------------------------------------------
>
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
More information about the Bioclusters
mailing list