[Biophp-dev] XML parser
yvan
biophp-dev@bioinformatics.org
Wed, 27 Aug 2003 09:39:28 +1000
This is a multi-part message in MIME format.
--Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA)
Content-type: text/plain; format=flowed; charset=us-ascii
Content-transfer-encoding: 7BIT
Thanks Dan, just using the concatanation and emptying the array at the
best moment, fix the problem.
Dan Bolser wrote:
>I had exactly that problem!
>
>My bug was not 'unsetting' the $currentTag pointer
>at the 'endOfTag' event, this way I couldn't
>
>$data{$currentTag} .= $characterData
>
>on the 'characterData' event, instead I was doing something
>like
>
>$data{$currentTag} = $characterData if !$data{$currentTag};
>
>As sometimes not all the characterData comes in one event,
>something to do with the characterData buffer, the above
>approach sometimes gave truncated data.
>
>Once I properly unset $currentTag, I could append all the
>caracterData for each tag properly!
>
>Here is my script, it uses a few optimizations and some
>specific tricks for my needs ($group), but most of this
>is behind the scenes...
>
>
>__SKIP__
>
>Skiping preamble,
>in breif...
>
>PIPE = named pipe (fifo) for 'load data infile'.
>$group = custom HSP grouping object.
>@file = list of results files to parse.
>$DIR = results files directory.
>
>use PDB_ISL; = custom data / the group object.
>
>back to the action...
>
>__RESUME__
>
>use XML::Parser;
>
>#------------------------------------------------
>#
># Initalise parser.
>#
>
>my $p = XML::Parser->new(
> Handlers => {
> Start => \&startEvent,
> Char => \&charEvent,
> End => \&endEvent,
> }
>);
>
>#------------------------------------------------
>#
># Set Globals for event handler communication.
>#
>
>my ( $pos, %que, %itr, %hit, %hsp ); # NB: $pos == $currentTag
>
># Here I decide which fields I want data from...
>
>my %QUE = %PDB_ISL::QUE; # Query sequence data fields
>my %ITR = %PDB_ISL::ITR; # Iteration data fields
>my %HIT = %PDB_ISL::HIT; # Hit sequence data fields
>my %HSP = %PDB_ISL::HSP; # High Scoring Segment Pair data fields.
>
>
>my @SCHEMA = @PDB_ISL::SCHEMA; # TABLE SCHEMA
>
>#------------------------------------------------
>#
># Begin.
>#
>
>foreach ( @file ){
> warn "Processing $DIR/$_\n";
> unless (-s "$DIR/$_"){
> warn "No such file\n";
> next;
> }
> $group = PDB_ISL->group(); # Get new HSP group object.
>
> $p->parsefile( "$DIR/$_" ); # For details, see Event handlers.
>}
>
>print "OK\n";
>
>#------------------------------------------------
>#
># Event handlers.
>#
>
>sub startEvent{ # <open_tag>
> my ( $self, $elem, %attr ) = @_;
>
> $pos = $elem; # Set currentTag!
>
> # NB: CASE order = frequency of tag occurence!
>
> #print "OPEN $elem\n";
>
> if ($pos eq 'Hsp'){ # CASE <HSP>
> #print "\nNEW HSP\n";
> %hsp = %HSP; # Reset HSP data
> }
> elsif ($pos eq 'Hit'){ # CASE <HIT>
> #print "NEW HIT\n";
> %hit = %HIT; # Reset HIT data
> }
> elsif ($pos eq 'Iteration'){ # CASE <ITERATION>
> #print "NEW ITR\n";
> %itr = %ITR; # Reset ITR data
> }
> elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file)
> #print "NEW OUT\n";
> %que = %QUE; # Reset QUE data
> }
>}
>
>sub charEvent{ # <>between tags</>
> my ( $expat, $text ) = @_;
>
> return unless $pos; # Very important!
>
> # NB: Only parse given fields. Ignore other data!
>
> # NB: CASE order as above!
>
> if ( exists $hsp{$pos} ){
> $hsp{$pos} .= $text; # Save HSP field data
> #print "HSP:$pos:$text\n";
> }
> elsif ( exists $hit{$pos} ){
> $hit{$pos} .= $text; # Save HIT field data
> #print "HIT:$pos:$text\n";
> }
> elsif ( exists $itr{$pos} ){
> $itr{$pos} .= $text; # Save ITR field data
> #print "ITR:$pos:$text\n";
> }
> elsif ( exists $que{$pos} ){
> $que{$pos} .= $text; # Save QUE field data
> #print "QUE:$pos:$text\n";
> }
>}
>
>sub endEvent{ # </close_tag>
> my ( $self, $elem ) = @_;
>
> $pos = undef; # Unset currentTag. Very important!
>
> #print "CLOSE $elem\n";
>
> if ($elem eq 'Hsp'){ # CASE </HSP>
>
> # TAKE A COPY!
> my %data = ( %que, %itr, %hit, %hsp );
>
> #print join("\t", map { $data{$_} } @SCHEMA),"\n";
>
> $group->add( \%data ); # ADD TO GROUP!
> }
>
> elsif($elem eq 'Hit'){ # CASE </HIT>
> # Hello Mum!
>
> }
>
> elsif($elem eq 'Iteration'){ # CASE </ITR>
>
> print "ITR:",
> $itr{'Iteration_iter-num'},"\n";
> print "MSG:",
> $itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'};
>
> }
>
> elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT>
>
> my $data = $group->getBest;
>
> flock(PIPE, 2) or die "$!:Can't lock pipe $PIPE\n";
>
> for(my $i=0; $i<@$data; $i++){
> print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n";
> }
>
> flock(PIPE, 8) or die "$!:Can't free pipe $PIPE\n";
>
> #exit;
> }
>}
>
>__END__
>
>yvan said:
>
>
>>Hi all,
>>
>>I am finishing up a parser for the xml output format of blast using the expat
>>library. When i collect the data returned by the dataHandler function, some of
>>them are truncated or a end of line is added, inducing a duplication. Did you
>>have already observed a something similar? As it doesn't happen always, I don't
>>suspect a script error. I am using the 1.95.1 version of expat, does a upgrade
>>will solve this problem?
>>
>>cheers
>>
>>yvan
>>
>>
>>_______________________________________________
>>Biophp-dev mailing list
>>Biophp-dev@bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/biophp-dev
>>
>>
>
>
>
>_______________________________________________
>Biophp-dev mailing list
>Biophp-dev@bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/biophp-dev
>
>
--Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA)
Content-type: text/html; charset=us-ascii
Content-transfer-encoding: 7BIT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
Thanks Dan, just using the concatanation and emptying the array at the
best moment, fix the problem.<br>
<br>
<br>
<br>
Dan Bolser wrote:<br>
<blockquote type="cite"
cite="mid33349.80.1.204.180.1061814086.squirrel@www.mrc-dunn.cam.ac.uk">
<pre wrap="">I had exactly that problem!
My bug was not 'unsetting' the $currentTag pointer
at the 'endOfTag' event, this way I couldn't
$data{$currentTag} .= $characterData
on the 'characterData' event, instead I was doing something
like
$data{$currentTag} = $characterData if !$data{$currentTag};
As sometimes not all the characterData comes in one event,
something to do with the characterData buffer, the above
approach sometimes gave truncated data.
Once I properly unset $currentTag, I could append all the
caracterData for each tag properly!
Here is my script, it uses a few optimizations and some
specific tricks for my needs ($group), but most of this
is behind the scenes...
__SKIP__
Skiping preamble,
in breif...
PIPE = named pipe (fifo) for 'load data infile'.
$group = custom HSP grouping object.
@file = list of results files to parse.
$DIR = results files directory.
use PDB_ISL; = custom data / the group object.
back to the action...
__RESUME__
use XML::Parser;
#------------------------------------------------
#
# Initalise parser.
#
my $p = XML::Parser->new(
Handlers => {
Start => \&startEvent,
Char => \&charEvent,
End => \&endEvent,
}
);
#------------------------------------------------
#
# Set Globals for event handler communication.
#
my ( $pos, %que, %itr, %hit, %hsp ); # NB: $pos == $currentTag
# Here I decide which fields I want data from...
my %QUE = %PDB_ISL::QUE; # Query sequence data fields
my %ITR = %PDB_ISL::ITR; # Iteration data fields
my %HIT = %PDB_ISL::HIT; # Hit sequence data fields
my %HSP = %PDB_ISL::HSP; # High Scoring Segment Pair data fields.
my @SCHEMA = @PDB_ISL::SCHEMA; # TABLE SCHEMA
#------------------------------------------------
#
# Begin.
#
foreach ( @file ){
warn "Processing $DIR/$_\n";
unless (-s "$DIR/$_"){
warn "No such file\n";
next;
}
$group = PDB_ISL->group(); # Get new HSP group object.
$p->parsefile( "$DIR/$_" ); # For details, see Event handlers.
}
print "OK\n";
#------------------------------------------------
#
# Event handlers.
#
sub startEvent{ # <open_tag>
my ( $self, $elem, %attr ) = @_;
$pos = $elem; # Set currentTag!
# NB: CASE order = frequency of tag occurence!
#print "OPEN $elem\n";
if ($pos eq 'Hsp'){ # CASE <HSP>
#print "\nNEW HSP\n";
%hsp = %HSP; # Reset HSP data
}
elsif ($pos eq 'Hit'){ # CASE <HIT>
#print "NEW HIT\n";
%hit = %HIT; # Reset HIT data
}
elsif ($pos eq 'Iteration'){ # CASE <ITERATION>
#print "NEW ITR\n";
%itr = %ITR; # Reset ITR data
}
elsif ($pos eq 'BlastOutput'){# CASE <OUTPUT> (one query per file)
#print "NEW OUT\n";
%que = %QUE; # Reset QUE data
}
}
sub charEvent{ # <>between tags</>
my ( $expat, $text ) = @_;
return unless $pos; # Very important!
# NB: Only parse given fields. Ignore other data!
# NB: CASE order as above!
if ( exists $hsp{$pos} ){
$hsp{$pos} .= $text; # Save HSP field data
#print "HSP:$pos:$text\n";
}
elsif ( exists $hit{$pos} ){
$hit{$pos} .= $text; # Save HIT field data
#print "HIT:$pos:$text\n";
}
elsif ( exists $itr{$pos} ){
$itr{$pos} .= $text; # Save ITR field data
#print "ITR:$pos:$text\n";
}
elsif ( exists $que{$pos} ){
$que{$pos} .= $text; # Save QUE field data
#print "QUE:$pos:$text\n";
}
}
sub endEvent{ # </close_tag>
my ( $self, $elem ) = @_;
$pos = undef; # Unset currentTag. Very important!
#print "CLOSE $elem\n";
if ($elem eq 'Hsp'){ # CASE </HSP>
# TAKE A COPY!
my %data = ( %que, %itr, %hit, %hsp );
#print join("\t", map { $data{$_} } @SCHEMA),"\n";
$group->add( \%data ); # ADD TO GROUP!
}
elsif($elem eq 'Hit'){ # CASE </HIT>
# Hello Mum!
}
elsif($elem eq 'Iteration'){ # CASE </ITR>
print "ITR:",
$itr{'Iteration_iter-num'},"\n";
print "MSG:",
$itr{'Iteration_message'}, "\n" if $itr{'Iteration_message'};
}
elsif ($elem eq 'BlastOutput'){ # CASE </OUTPUT>
my $data = $group->getBest;
flock(PIPE, 2) or die "$!:Can't lock pipe $PIPE\n";
for(my $i=0; $i<@$data; $i++){
print PIPE join("\t", map { $data->[$i]->{$_} } @SCHEMA),"\n";
}
flock(PIPE, 8) or die "$!:Can't free pipe $PIPE\n";
#exit;
}
}
__END__
yvan said:
</pre>
<blockquote type="cite">
<pre wrap="">Hi all,
I am finishing up a parser for the xml output format of blast using the expat
library. When i collect the data returned by the dataHandler function, some of
them are truncated or a end of line is added, inducing a duplication. Did you
have already observed a something similar? As it doesn't happen always, I don't
suspect a script error. I am using the 1.95.1 version of expat, does a upgrade
will solve this problem?
cheers
yvan
_______________________________________________
Biophp-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Biophp-dev@bioinformatics.org">Biophp-dev@bioinformatics.org</a>
<a class="moz-txt-link-freetext" href="https://bioinformatics.org/mailman/listinfo/biophp-dev">https://bioinformatics.org/mailman/listinfo/biophp-dev</a>
</pre>
</blockquote>
<pre wrap=""><!---->
_______________________________________________
Biophp-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Biophp-dev@bioinformatics.org">Biophp-dev@bioinformatics.org</a>
<a class="moz-txt-link-freetext" href="https://bioinformatics.org/mailman/listinfo/biophp-dev">https://bioinformatics.org/mailman/listinfo/biophp-dev</a>
</pre>
</blockquote>
<br>
</body>
</html>
--Boundary_(ID_Fs4osyRc+qetxgTEVrbYwA)--