[Bioclusters] Average Pairwise Scores for reciprocal Blast hits

Wed Jun 1 08:25:03 EDT 2005

Hi,

I performed All-vs-All blast for 214777 proteins and wanted to 
cluster these proteins. For clustering, I read that its better
if we use average pairwise evalues for bi-directional hits. 

I wrote a perl script to read blast results, to store hit evalues 
for each query in a hash reference, to find bi-directionabl blast
hits and calculate average pairwise evalues for them.
I used hash references since these can be bit fast. 

I tested the following script on a machine with 2G of ram and on 
another with 94G of ram. It ran out of memory in both cases.

Is it the case that it is not possible to hold this much data into 
a hash? 

Any suggestions to resolve the problem?

Many Thanks,

Intikhab Alam,

I use the script as follows:

To Fill the hash:

while(<FILE>){
    my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst, $send, $evalue, $bit)= split /\s+|\t+/, $_;
    $query_results->{$q}{$h}{evalue}=$evalue;
}
close FILE;

To get average pairwise evalues:

my ($c, $hits)=(0, 0);
open F, ">pairwiseValues";
foreach my $query (%{$query_results}){
        if (exists ($query_results->{$query})){
        $c++;
        print "Query:$c:$query\n";
        foreach my $hit (%{$query_results->{$query}}){
                if (exists ($query_results->{$query}{$hit}{evalue})){

                my $k=$query_results->{$query}{$hit}{evalue};
                        if ((exists ($query_results->{$query}{$hit}{evalue}) && ($query_results->{$hit}{$query}{evalue}))){
                        my $i= $query_results->{$query}{$hit}{evalue};
                        my $j= $query_results->{$hit}{$query}{evalue};

                        $k=(($i+$j)/2);

                        $query_results->{$query}{$hit}{evalue}=$k;
                        $query_results->{$hit}{$query}{evalue}=$k;
                        }
                $hits++;
                print F "$query\t$hit\t$k\n";
                }
        }
        }
}
close F; 
print "$hits hits read for $c queries \n";

----------------------------------------------