[Bioclusters] BioPerl and memory handling

Mon Nov 29 18:32:50 EST 2004

After a recent conversation about memory in Perl, I decided to do some 
actual experiments. Here's the email I composed on the subject.

I looked into the Perl memory issue. It's true that if you allocate a 
huge amount of memory that Perl doesn't like to give it back. But it's 
not as bad a situation as you might think. Let's say you do something 
like

	$FOO = 'N' x 100000000;

That will allocate a chunk of about 192 Mb on my system. It doesn't 
matter if this is a package variable or lexical.

	our $FOO = 'N' x 100000000; # 192 Mb
	my  $FOO = 'N' x 100000000; # 192 Mb

If you put this in a subroutine

	sub foo {my $FOO = 'N' x 100000000}

and you call this a bunch of times

	foo(); foo(); foo(); foo(); foo(); foo(); foo();

the memory footprint stays at 192 Mb. So Perl's garbage collection 
works just fine. Perl doesn't let go of the memory it has taken from 
the OS, but it is happy to reassign the memory it has reserved.

Here's something odd. The following labeled block looks like it should 
use no memory.

	BLOCK: {
		my  $FOO = 'N' x 100000000;
	}

The weird thing is that after executing the block, the memory footprint 
is still 192 Mb as if it hadn't been garbage collected.

Now look at this:

	my $foo = 'X' x 100000000;
	undef $foo;

This has a memory footprint of 96 Mb. After some more experimentation, 
I have come up with the following interpretation of memory allocation 
and garbage collection in Perl. Perl will reuse memory for a variable 
of a given name (either package or lexical scope). There is no fear of 
memory leaks in loops for example. But each different named variable 
will retain its own minimum memory. That minimum memory is the size of 
the largest memory allocated to that variable, or half that amount if 
other variables have taken some of that space already. You can get any 
variable to automatically give up half its memory with undef. But this 
takes a little more CPU time. Here's some test code that shows this 
behavior.

sub foo {my $FOO = 'N' x 100000000}
for (my $i = 0; $i < 50; $i++) {foo()} # 29.420u 1.040s

sub bar {my $BAR = 'N' x 100000000; undef $BAR}
for (my $i = 0; $i < 50; $i++) {bar()} # 26.880u 21.220s

The increase from 1 sec to 21 sec system CPU time is all the extra 
memory allocation and freeing associated with the undef statement. Why 
the user time is less in the undef example is a mystery to me.

OK, to make a hideously long story short, use undef to save memory and 
use the same variable name over and over if you can.

---

But this email thread has gone to BPlite, of which I am the original 
author. BPlite is designed to parse a stream and only reads a minimal 
amount of information at a time. The disadvantage of this is that if 
you want to know something about statistics, you can't get this until 
the end of the report (the original BPlite ignored statistics 
entirely). I like the new SearchIO interface better than BPlite, but 
for my own uses I generally use a table format most of the time and 
don't really use a BLAST parser very often.

-Ian

On Nov 29, 2004, at 3:03 PM, Mike Cariaso wrote:

> This message is being cross posted from bioclusters to
> bioperl. I'd appreciate a clarification from anyone in
> bioperl who can speak more authoritatively than my
> semi-speculation.
>
>
> Perl does have a garbage collector. It is not wildly
> sophisticated. As you've suggested it uses simple
> reference counting. This means that circular
> references will cause memory to be held until program
> termination.
>
> However I think you are overstating the inefficiency
> in the system. While the perl GC *may* not release
> memory to the system, it does at least allow memory to
> be reused within the process.
>
> If the system instead behaved as you describe, I think
> perl would hemorrhage memory and would be unsuitable
> for any long running processes.
>
> However I can say with considerable certainty that
> that BPLite is able to handle blast reports which
> cause SearchIO to thrash. I've attributed this to
> BPLite being a true stream processor, while SearchIO
> seems to slurp the whole file and object heirarchy
> into memory.
>
> I know that SearchIO is the prefered blast parser, but
> it seems that BPLite is not quite dead, for the
> reasons above. If this is infact the unique benefit of
> BPLite, perhaps the documentation should be clearer
> about this, as I suspect I'm not the only person to
> have had to reengineer a substantial piece of code to
> adjust between their different models. Had I known of
> this difference early on I would have chosen BPLite.
>
> So, bioperlers (especially Jason Stajich) can you shed
> any light on this vestigial bioperl organ?
>
>
>
> --- Malay <mbasu at mail.nih.gov> wrote:
>
>> Michael Cariaso wrote:
>>> Michael Maibaum wrote:
>>>
>>>>
>>>> On 10 Nov 2004, at 18:25, Al Tucker wrote:
>>>>
>>>>> Hi everybody.
>>>>>
>>>>> We're new to the Inquiry Xserve scientific
>> cluster and trying to iron
>>>>> out a few things.
>>>>>
>>>>> One thing is we seem to be coming up against is
>> an out of memory
>>>>> error when getting large sequence analysis
>> results (5,000 seq - at
>>>>> least- and above) back from BTblastall. The
>> problem seems to be with
>>>>> BioPerl.
>>>>>
>>>>> Might anyone here know if BioPerl is knows
>> enough not to try and
>>>>> access more than 4gb of RAM in a single process
>> (an OS X limit)? I'm
>>>>> told Blastall and BTblastall are and will chunk
>> problems accordingly,
>>>>> but we're not certain if BioPerl is when called
>> to merge large Blast
>>>>> results back together. It's the default version
>> 1.2.3 that's supplied
>>>>> btw, and OS X 10.3.5 with all current updates
>> just short of the
>>>>> latest 10.3.6 update.
>>>>
>>>>
>>
>>
>>>> BioPerl tries to slurp up the entire results set
>> from a BLAST query,
>>>> and build objects for each little bit of the
>> result set and uses lots
>>>> of memory. It doesn't have anything smart at all
>> about breaking up the
>>>> job within the result set, afaik.
>>>>
>>
>> This is not really true. SearchIO module as far as I
>> know works on stream.
>>
>>>>  I ended up stripping out results that hit a
>> certain threshold size to
>>>> run on a different, large memory opteron/linux
>> box and I'm
>>>> experimenting with replacing BioPerl with
>> BioPython etc.
>>>>
>>>> Michael
>>>
>>>
>>> You may find hthat the BPLite parser works better
>> when dealing with
>>> large blast result files. Its not as clean or
>> maintained, but it does
>>> the job nicely for my current needs, which
>> overloaded the usual parser.
>>
>> There is basically no difference between BPLite and
>> other BLAST parser
>> interfaces in Bioperl.
>>
>>
>> The problem lies in the core of Perl iteself. Perl
>> does not release
>> memory to the system even after the reference count
>> of an object created
>> in the memory goes to 0, unless the program in
>> actually over. Perl
>> object system in highly inefficient to handle large
>> number of objects
>> created in the memory.
>>
>> -Malay
>> _______________________________________________
>> Bioclusters maillist  -
>> Bioclusters at bioinformatics.org
>>
> https://bioinformatics.org/mailman/listinfo/bioclusters
>>
>
>
> =====
> Mike Cariaso
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>