[Biodevelopers] (/o*/) (\*o\) XML for huge DB!

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Sat Aug 2 08:48:25 EDT 2003


YAY!

On Fri, 1 Aug 2003, Patrick McConnell wrote:
> 
> Perhaps your parser is returning character data in separate chunks so that
> you are getting events like this:
>       (1) Element start: hsp_num
>       (2) Characters: 1
>       (3) Characters: 2
>       (4) Element end: hsp_num
> In this case, your hsp num is supposed to be 12, but gets split because the
> character buffer was filled at this random time.  I think such an occurance
> is legal for a SAX parser, but I am no expert.

I think this was the problem! 

I fixed my code so '$currentTag = undef' in the 'endOfTag' handler,
this allowed me to use '$data{$currentTag} .= $text' in my 'char' 
handler.

Previously I only set '$data{$curTag} = $text' if it was 
unset (I reset %data at appropriate points) to avoid
stray character text coming AFTER the $currentTag, for some 
parent tag which I was ignoring. 

> You will have to post your SAX handler code for us to be any help, I think.

I will hapily post the whole script which is ~ 150 lines without frills.

It is fast too (compared to previously), uploading ~ 2000 rows per second
to mysql from one cpu.

By printing the data to a named pipe (mkfifo), I can easily invoke more
than one process, writing to the same pipe - mysql does its optimized
'load data infile 'named.pipe' into table... and everything works great.

I benchmarked 10,000,000 'flock' commands (used to keep pipe writes
exclusive to one process for the duration of one record), and it only
takes about 20 seconds, so the whole thing scales well.


Thanks very much for all the help from everyone, I would have been
lost without it!

Cheers, 
Dan.




More information about the Biodevelopers mailing list