Problem parseing a XML - PullParser


#1

Hi all,

I need to parse a XML file “line by line” because of a application
limitation, so i am trying to build a Stream/Pull xml parser with the
rexml library, but i can’t get it to work…

  • Anyone knows what can be causing this error? -> Missing end tag for
    ‘’
  • This error even happens with a simple xml like this one:
    psudo_xml = <<EOF
<?xml version="1.0" encoding="UTF-8"?> EOF

Error:

DBG: event_type: text
TXT Normal
DBG: event_type: end_element
END Mode
/opt/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:330:in pull': Missing end tag for '' (got "SChange") (REXML::ParseException) Line: Position: Last 80 unconsumed characters: from /opt/local/lib/ruby/1.8/rexml/parsers/pullparser.rb:68:inpull’
from text2.rb:13:in parse' from text2.rb:32:inline_process’
from text2.rb:47

Ruby code

require “stringio”
require ‘rexml/parsers/pullparser’

class BaseParser
def initialize
@parser = nil
end

def parse(raw_xml)
@parser = REXML::Parsers::PullParser.new(raw_xml)

while @parser.has_next?
  pull_event = @parser.pull
  puts "DBG: event_type: #{pull_event.event_type}"

  if pull_event.error?
    puts "\tERR #{pull_event[0]} - #{pull_event[0]}"
  elsif pull_event.start_element?
    puts "\tSTART #{pull_event[0]}"
  elsif pull_event.end_element?
    puts "\tEND #{pull_event[0]}"
  elsif pull_event.text?
    puts "\tTXT #{pull_event[0]}"
  end
end

end
end

def line_process(ios,myparser)
while (line = ios.gets)
line.chomp!
myparser.parse(line)
end
end

psudo_xml = <<EOF

<?xml version="1.0" encoding="UTF-8"?> Testing:service Critical Normal EOF

psudo_xml_io = StringIO.new(psudo_xml)
line_process(psudo_xml_io,BaseParser.new)

Thanks for any help in advance.


#2

On 9 déc. 08, at 21:17, Sebastian (syepes) wrote:

@parser = REXML::Parsers::PullParser.new(raw_xml)

You instantiate a new pull parser for each line, so the state is
obviously lost after each line and when you feed the last parser with
it naturally complains because it doesn’t know what you’re
talking about :slight_smile:


#3

Luc H. wrote:

On 9 d�c. 08, at 21:17, Sebastian (syepes) wrote:

@parser = REXML::Parsers::PullParser.new(raw_xml)

You instantiate a new pull parser for each line, so the state is
obviously lost after each line and when you feed the last parser with
it naturally complains because it doesn’t know what you’re
talking about :slight_smile:

Mmm, so is there a way to "parse* each line of the XML independently,
and is it posible with the PullParser library?


while (line = ios.gets)
parse_line_of_xml(line)
end

Regards,


#4

On 10 déc. 08, at 14:35, Sebastian (syepes) wrote:

Mmm, so is there a way to "parse* each line of the XML independently

Why do you want to do that exactly? If you don’t have the whole XML
file at once and only have an IO like object, you can directly pass
this object to the pull parser which should simply block until enough
data is available to produce each events.


#5

Hi,

On 10-Dec-08, at 8:35 AM, Sebastian (syepes) wrote:

Mmm, so is there a way to "parse* each line of the XML independently,
and is it posible with the PullParser library?

Having written a pull parser, I’d have to say: No.

The parser is going to be looking for ‘events’, and it is going to
want to deal with well-formedness issues if it is an actual xml parser.

What are you trying to do, maybe that’s a better place to start.

Cheers,
Bob


Bob H.
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://www.recursive.ca/hutch


#6

On 11 déc. 08, at 11:26, Sebastian (syepes) wrote:

The real problem is that the function: ex_listener, processes the XML
“line by line” because i can’t detect a EOF from the IO.popen and it
will always be waiting (open) for the next change “xml stream”.

Right, but since you control the parser state you know exactly when
and where the document starts and when and where it ends, so you
should be able to close the connection by yourself.


#7

Luc H. wrote:

On 11 d�c. 08, at 11:26, Sebastian (syepes) wrote:

The real problem is that the function: ex_listener, processes the XML
“line by line” because i can’t detect a EOF from the IO.popen and it
will always be waiting (open) for the next change “xml stream”.

Right, but since you control the parser state you know exactly when
and where the document starts and when and where it ends, so you
should be able to close the connection by yourself.

Ok i get the point, but i don’t see how to detect the EOF (Without using
some ugly code) and pass the hole *xml to the Parser.
Any examples please.

Thanks fro the help.


#8

On 11 déc. 08, at 18:27, Sebastian (syepes) wrote:

Ok i get the point, but i don’t see how to detect the EOF (Without
using
some ugly code) and pass the hole *xml to the Parser.

I’m still not exactly sure of your exact context, but you don’t have
to detect the EOF, just parse and when you reach the end of the
document close the pipe yourself on your end.


#9

Hi,

On 11-Dec-08, at 12:31 PM, Luc H. wrote:

On 11 déc. 08, at 18:27, Sebastian (syepes) wrote:

Ok i get the point, but i don’t see how to detect the EOF (Without
using
some ugly code) and pass the hole *xml to the Parser.

I’m still not exactly sure of your exact context, but you don’t have
to detect the EOF, just parse and when you reach the end of the
document close the pipe yourself on your end.

Just for fun, I tried hacking something together using the pull parser
that I wrote. This pointed out one possible issue that is confusing,
I’ll get to that in a second.

How to avoid waiting for an EOF? Count events. Crudely, if you
increment the count on a start element event, and decrement on an end
element, when the count goes to zero, you’ve got what you are looking
for. This means you are letting the pull parser read the input, you
don’t do it for the parser.

The issue I mentioned… In my pull parser I’m assuming a file or
string input, not an IO stream. I take advantage of that by looking
ahead a bit. This isn’t a problem unless you are using a stream. In my
parser’s case, it is looking ahead to at least the end of the next
line (huge performance thing with files). The confusing effect is with
the stream input:

Testing:service Critical Normal Testing:service Critical Normal

The close of the first SChange element isn’t reported until the next
line is read, which happens to include the start of the next element.
This is a delayed effect that is maybe not the best for a stream
input. If you add a blank line between the events the problem goes
away (but it’ll read the blank line before reporting which shouldn’t
be a problem).

It is possible that this is affecting your testing.

Cheers,
Bob


Luc H. - removed_email_address@domain.invalid


Bob H.
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://www.recursive.ca/hutch


#10

Bob H. wrote:

Hi,

On 10-Dec-08, at 8:35 AM, Sebastian (syepes) wrote:

Mmm, so is there a way to "parse* each line of the XML independently,
and is it posible with the PullParser library?

Having written a pull parser, I’d have to say: No.

The parser is going to be looking for ‘events’, and it is going to
want to deal with well-formedness issues if it is an actual xml parser.

What are you trying to do, maybe that’s a better place to start.

Cheers,
Bob


Bob H.
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://www.recursive.ca/hutch

Ok, this is the problem i am trying to solve:
I need to parse a XML that comes from the stdout of a unix program, the
program sends a xml* stream when it detects a change and the IO.popen
stays open until the next change.

The real problem is that the function: ex_listener, processes the XML
“line by line” because i can’t detect a EOF from the IO.popen and it
will always be waiting (open) for the next change “xml stream”.

I have tried using “lines = ios.readlines”, but it does not work because
there’s no EOF, is there some other way of doing this?

I would appreciate any suggestions on how to solve this problem.

*xml: Sent when a change is detected

Testing:service Critical Normal

Ruby

UNIX_PROG = “/bin/xml_stream”

def ex_connect
ios = IO.popen(UNIX_PROG,“w+”)
ios.sync = true

line = ios.gets
if line =~ /xml/
puts “INF: Connected OK (XML)”
ios.puts “”
return ios
else
puts “ERR: Cannot connect”
exit 1
end
end

def ex_listener(ios)
while (line = ios.gets)
line.chomp!
if line =~ /</Events>/
puts “INF: END of program”
exit 0
end

puts "INF: #{line} - #{line.size}"
*parse_line_of_xml(line)*

end
end

ios = ex_connect
ex_listener(ios) # Processes the XML stream

Regards,


#11

If I understand correctly, you want to keep an IO stream open, and
react to certain elements as they appear? That’s a textbook SAX case,
not pull-parsing. Register a SAX handler for your SChange events, and
point your IO stream at it.

I’d use libxml-ruby, but REXML has a stream parser than is SAX-like.
You’d use it something like this (untested)

require “rexml/document”
require “rexml/streamlistener”
include REXML

class Handler
include StreamListener
def tag_start name, attrs
if name==“SChange”
#do something
puts attrs
end
end
end

Document.parse_stream(your_io_stream, Handler.new)

– Mark.