[Bioclusters] Average Pairwise Scores for reciprocal Blast hits

Wed Jun 1 11:20:21 EDT 2005

I would assume that it's a memory problem.  Scaling the system memory  
to fit the script is an expensive (though not uncommon) way to  
approach it.

One simple thing to do would be to impose a threshold at which you  
would discard pairwise hits.  That would let you get some limited set  
of answers, and at the same time verify that the number of members in  
the hash is really the problem.  So:

while(<FILE>){
     my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,  
$send, $evalue, $bit)= split /\s+|\t+/, $_;
    if ($evalue < 0.000001) { $query_results->{$q}{$h}{evalue}= 
$evalue; }
}

The problem with this approach is that there are some edge cases  
where one member of the pair of alignments scored below the threshold  
and the other did not.  In the first approximation, I would count the  
number of times that happens.  It's probably vanishingly small.

To process the entire dataset, you will probably end up doing  
multiple passes on the data.

You could reduce the number of hash references in the problem by  
getting rid of the {$evalue}.  My starting point would be to do it in  
one pass, like this instead.   Warning:  not syntax checked at  
all...consider it pseudocode:

my $threshold = 0.000001
while(<FILE>){
     my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,  
$send, $evalue, $bit)= split /\s+|\t+/, $_;
     if ($evalue < $threshold)  {
     if (defined($query_results->{$h}{$q}) {
       print "$q\t$h\t" . ($query_results->{$h}{$q} + $evalue) / 2 .  
"\n";
       $query_results->{$h}{$q} = undef;
      } else {
        $query_results->{$q}{$h} = $evalue;
      }
   }
}
# Then add code to print out the remaining ones that didn't have  
pairwise hits.

A popular way for this to fail would be if you have any non-unique  
sequence identifiers in your dataset.

Good luck!

-Chris Dwan

On Jun 1, 2005, at 8:25 AM, Intikhab Alam wrote:

> Hi,
>
> I performed All-vs-All blast for 214777 proteins and wanted to
> cluster these proteins. For clustering, I read that its better
> if we use average pairwise evalues for bi-directional hits.
>
> I wrote a perl script to read blast results, to store hit evalues
> for each query in a hash reference, to find bi-directionabl blast
> hits and calculate average pairwise evalues for them.
> I used hash references since these can be bit fast.
>
>
> I tested the following script on a machine with 2G of ram and on
> another with 94G of ram. It ran out of memory in both cases.
>
> Is it the case that it is not possible to hold this much data into
> a hash?
>
> Any suggestions to resolve the problem?
>
>
> Many Thanks,
>
> Intikhab Alam,
>
>
>
> I use the script as follows:
>
> To Fill the hash:
>
> while(<FILE>){
>     my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,  
> $send, $evalue, $bit)= split /\s+|\t+/, $_;
>     $query_results->{$q}{$h}{evalue}=$evalue;
> }
> close FILE;
>
> To get average pairwise evalues:
>
> my ($c, $hits)=(0, 0);
> open F, ">pairwiseValues";
> foreach my $query (%{$query_results}){
>         if (exists ($query_results->{$query})){
>         $c++;
>         print "Query:$c:$query\n";
>         foreach my $hit (%{$query_results->{$query}}){
>                 if (exists ($query_results->{$query}{$hit}{evalue})){
>
>                 my $k=$query_results->{$query}{$hit}{evalue};
>                         if ((exists ($query_results->{$query}{$hit} 
> {evalue}) && ($query_results->{$hit}{$query}{evalue}))){
>                         my $i= $query_results->{$query}{$hit}{evalue};
>                         my $j= $query_results->{$hit}{$query}{evalue};
>
>                         $k=(($i+$j)/2);
>
>                         $query_results->{$query}{$hit}{evalue}=$k;
>                         $query_results->{$hit}{$query}{evalue}=$k;
>                         }
>                 $hits++;
>                 print F "$query\t$hit\t$k\n";
>                 }
>         }
>         }
> }
> close F;
> print "$hits hits read for $c queries \n";
>
> ----------------------------------------------
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>