I would assume that it's a memory problem. Scaling the system memory to fit the script is an expensive (though not uncommon) way to approach it. One simple thing to do would be to impose a threshold at which you would discard pairwise hits. That would let you get some limited set of answers, and at the same time verify that the number of members in the hash is really the problem. So: while(<FILE>){ my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst, $send, $evalue, $bit)= split /\s+|\t+/, $_; if ($evalue < 0.000001) { $query_results->{$q}{$h}{evalue}= $evalue; } } The problem with this approach is that there are some edge cases where one member of the pair of alignments scored below the threshold and the other did not. In the first approximation, I would count the number of times that happens. It's probably vanishingly small. To process the entire dataset, you will probably end up doing multiple passes on the data. You could reduce the number of hash references in the problem by getting rid of the {$evalue}. My starting point would be to do it in one pass, like this instead. Warning: not syntax checked at all...consider it pseudocode: my $threshold = 0.000001 while(<FILE>){ my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst, $send, $evalue, $bit)= split /\s+|\t+/, $_; if ($evalue < $threshold) { if (defined($query_results->{$h}{$q}) { print "$q\t$h\t" . ($query_results->{$h}{$q} + $evalue) / 2 . "\n"; $query_results->{$h}{$q} = undef; } else { $query_results->{$q}{$h} = $evalue; } } } # Then add code to print out the remaining ones that didn't have pairwise hits. A popular way for this to fail would be if you have any non-unique sequence identifiers in your dataset. Good luck! -Chris Dwan On Jun 1, 2005, at 8:25 AM, Intikhab Alam wrote: > Hi, > > I performed All-vs-All blast for 214777 proteins and wanted to > cluster these proteins. For clustering, I read that its better > if we use average pairwise evalues for bi-directional hits. > > I wrote a perl script to read blast results, to store hit evalues > for each query in a hash reference, to find bi-directionabl blast > hits and calculate average pairwise evalues for them. > I used hash references since these can be bit fast. > > > I tested the following script on a machine with 2G of ram and on > another with 94G of ram. It ran out of memory in both cases. > > Is it the case that it is not possible to hold this much data into > a hash? > > Any suggestions to resolve the problem? > > > Many Thanks, > > Intikhab Alam, > > > > I use the script as follows: > > To Fill the hash: > > while(<FILE>){ > my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst, > $send, $evalue, $bit)= split /\s+|\t+/, $_; > $query_results->{$q}{$h}{evalue}=$evalue; > } > close FILE; > > To get average pairwise evalues: > > my ($c, $hits)=(0, 0); > open F, ">pairwiseValues"; > foreach my $query (%{$query_results}){ > if (exists ($query_results->{$query})){ > $c++; > print "Query:$c:$query\n"; > foreach my $hit (%{$query_results->{$query}}){ > if (exists ($query_results->{$query}{$hit}{evalue})){ > > my $k=$query_results->{$query}{$hit}{evalue}; > if ((exists ($query_results->{$query}{$hit} > {evalue}) && ($query_results->{$hit}{$query}{evalue}))){ > my $i= $query_results->{$query}{$hit}{evalue}; > my $j= $query_results->{$hit}{$query}{evalue}; > > $k=(($i+$j)/2); > > $query_results->{$query}{$hit}{evalue}=$k; > $query_results->{$hit}{$query}{evalue}=$k; > } > $hits++; > print F "$query\t$hit\t$k\n"; > } > } > } > } > close F; > print "$hits hits read for $c queries \n"; > > ---------------------------------------------- > > > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters >