Forum: Ruby Problem parseing a XML - PullParser

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
7c7781f21f788c6e0b9a4e179217a33c?d=identicon&s=25 Sebastian YEPES (syepes)
on 2008-12-09 21:24
Hi all,

I need to parse a XML file "line by line" because of a application
limitation, so i am trying to build a Stream/Pull xml parser with the
rexml library, but i can't get it to work..

 - Anyone knows what can be causing this error? -> Missing end tag for
''
 - This error even happens with a simple xml like this one:
psudo_xml = <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<SChange>
</SChange>
EOF



Error:
----------
DBG: event_type: text
  TXT Normal
DBG: event_type: end_element
  END Mode
/opt/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:330:in `pull':
Missing end tag for '' (got "SChange") (REXML::ParseException)
Line:
Position:
Last 80 unconsumed characters:
  from /opt/local/lib/ruby/1.8/rexml/parsers/pullparser.rb:68:in `pull'
  from text2.rb:13:in `parse'
  from text2.rb:32:in `line_process'
  from text2.rb:47



Ruby code
----------
require "stringio"
require 'rexml/parsers/pullparser'

class BaseParser
  def initialize
    @parser = nil
  end

  def parse(raw_xml)
    @parser = REXML::Parsers::PullParser.new(raw_xml)

    while @parser.has_next?
      pull_event = @parser.pull
      puts "DBG: event_type: #{pull_event.event_type}"

      if pull_event.error?
        puts "\tERR #{pull_event[0]} - #{pull_event[0]}"
      elsif pull_event.start_element?
        puts "\tSTART #{pull_event[0]}"
      elsif pull_event.end_element?
        puts "\tEND #{pull_event[0]}"
      elsif pull_event.text?
        puts "\tTXT #{pull_event[0]}"
      end
    end
  end
end

def line_process(ios,myparser)
  while (line = ios.gets)
    line.chomp!
    myparser.parse(line)
  end
end


psudo_xml = <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<SChange>
  <Service>Testing:service</Service>
  <Status>Critical</Status>
  <Mode>Normal</Mode>
</SChange>
EOF

psudo_xml_io = StringIO.new(psudo_xml)
line_process(psudo_xml_io,BaseParser.new)
----------


Thanks for any help in advance.
8a85c693f13ef7cb542ef94d2a403d4d?d=identicon&s=25 Luc Heinrich (Guest)
on 2008-12-10 08:46
(Received via mailing list)
On 9 déc. 08, at 21:17, Sebastian (syepes) wrote:

>    @parser = REXML::Parsers::PullParser.new(raw_xml)

You instantiate a *new* pull parser for *each* line, so the state is
obviously lost after each line and when you feed the last parser with
</SChange> it naturally complains because it doesn't know what you're
talking about :)
7c7781f21f788c6e0b9a4e179217a33c?d=identicon&s=25 Sebastian YEPES (syepes)
on 2008-12-10 14:41
Luc Heinrich wrote:
> On 9 d�c. 08, at 21:17, Sebastian (syepes) wrote:
>
>>    @parser = REXML::Parsers::PullParser.new(raw_xml)
>
> You instantiate a *new* pull parser for *each* line, so the state is
> obviously lost after each line and when you feed the last parser with
> </SChange> it naturally complains because it doesn't know what you're
> talking about :)

Mmm, so is there a way to "parse* each line of the XML independently,
and is it posible with the PullParser library?

--
while (line = ios.gets)
  *parse_line_of_xml(line)*
end
--

Regards,
8a85c693f13ef7cb542ef94d2a403d4d?d=identicon&s=25 Luc Heinrich (Guest)
on 2008-12-10 15:14
(Received via mailing list)
On 10 déc. 08, at 14:35, Sebastian (syepes) wrote:

> Mmm, so is there a way to "parse* each line of the XML independently

Why do you want to do that exactly? If you don't have the whole XML
file at once and only have an IO like object, you can directly pass
this object to the pull parser which should simply block until enough
data is available to produce each events.
F9458b96d9cd021f6193504abf8b2978?d=identicon&s=25 Bob Hutchison (Guest)
on 2008-12-10 16:16
(Received via mailing list)
Hi,

On 10-Dec-08, at 8:35 AM, Sebastian (syepes) wrote:

> Mmm, so is there a way to "parse* each line of the XML independently,
> and is it posible with the PullParser library?


Having written a pull parser, I'd have to say: No.

The parser is going to be looking for 'events', and it is going to
want to deal with well-formedness issues if it is an actual xml parser.

What are you trying to do, maybe that's a better place to start.

Cheers,
Bob

----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://www.recursive.ca/hutch
7c7781f21f788c6e0b9a4e179217a33c?d=identicon&s=25 Sebastian YEPES (syepes)
on 2008-12-11 11:33
Bob Hutchison wrote:
> Hi,
>
> On 10-Dec-08, at 8:35 AM, Sebastian (syepes) wrote:
>
>> Mmm, so is there a way to "parse* each line of the XML independently,
>> and is it posible with the PullParser library?
>
>
> Having written a pull parser, I'd have to say: No.
>
> The parser is going to be looking for 'events', and it is going to
> want to deal with well-formedness issues if it is an actual xml parser.
>
> What are you trying to do, maybe that's a better place to start.
>
> Cheers,
> Bob
>
> ----
> Bob Hutchison
> Recursive Design Inc.
> http://www.recursive.ca/
> weblog: http://www.recursive.ca/hutch


Ok, this is the problem i am trying to solve:
I need to parse a XML that comes from the stdout of a unix program, the
program sends a xml* stream when it detects a change and the IO.popen
stays open until the next change.

The real problem is that the function: ex_listener, processes the XML
"line by line" because i can't detect a EOF from the IO.popen and it
will always be waiting (open) for the next change "xml stream".


I have tried using "lines = ios.readlines", but it does not work because
there's no EOF, is there some other way of doing this?

I would appreciate any suggestions on how to solve this problem.


*xml: Sent when a change is detected
---
<SChange>
  <Service>Testing:service</Service>
  <Status>Critical</Status>
  <Mode>Normal</Mode>
</SChange>

Ruby
---------
UNIX_PROG = "/bin/xml_stream"

def ex_connect
  ios = IO.popen(UNIX_PROG,"w+")
  ios.sync = true

  line = ios.gets
  if line =~ /xml/
    puts "INF: Connected OK (XML)"
    ios.puts "<Events>"
    return ios
  else
    puts "ERR: Cannot connect"
    exit 1
  end
end

def ex_listener(ios)
  while (line = ios.gets)
    line.chomp!
    if line =~ /<\/Events>/
      puts "INF: END of program"
      exit 0
    end

    puts "INF: #{line} - #{line.size}"
    *parse_line_of_xml(line)*
  end
end

ios = ex_connect
ex_listener(ios) # Processes the XML stream
---------


Regards,
8a85c693f13ef7cb542ef94d2a403d4d?d=identicon&s=25 Luc Heinrich (Guest)
on 2008-12-11 14:08
(Received via mailing list)
On 11 déc. 08, at 11:26, Sebastian (syepes) wrote:

> The real problem is that the function: ex_listener, processes the XML
> "line by line" because i can't detect a EOF from the IO.popen and it
> will always be waiting (open) for the next change "xml stream".

Right, but since you control the parser state you know exactly when
and where the document starts and when and where it ends, so you
should be able to close the connection by yourself.
7c7781f21f788c6e0b9a4e179217a33c?d=identicon&s=25 Sebastian YEPES (syepes)
on 2008-12-11 18:34
Luc Heinrich wrote:
> On 11 d�c. 08, at 11:26, Sebastian (syepes) wrote:
>
>> The real problem is that the function: ex_listener, processes the XML
>> "line by line" because i can't detect a EOF from the IO.popen and it
>> will always be waiting (open) for the next change "xml stream".
>
> Right, but since you control the parser state you know exactly when
> and where the document starts and when and where it ends, so you
> should be able to close the connection by yourself.

Ok i get the point, but i don't see how to detect the EOF (Without using
some ugly code) and pass the hole *xml to the Parser.
Any examples please.


Thanks fro the help.
8a85c693f13ef7cb542ef94d2a403d4d?d=identicon&s=25 Luc Heinrich (Guest)
on 2008-12-11 18:38
(Received via mailing list)
On 11 déc. 08, at 18:27, Sebastian (syepes) wrote:

> Ok i get the point, but i don't see how to detect the EOF (Without
> using
> some ugly code) and pass the hole *xml to the Parser.

I'm still not exactly sure of your exact context, but you don't have
to detect the EOF, just parse and when you reach the end of the
document close the pipe yourself on your end.
F9458b96d9cd021f6193504abf8b2978?d=identicon&s=25 Bob Hutchison (Guest)
on 2008-12-12 15:18
(Received via mailing list)
Hi,

On 11-Dec-08, at 12:31 PM, Luc Heinrich wrote:

> On 11 déc. 08, at 18:27, Sebastian (syepes) wrote:
>
>> Ok i get the point, but i don't see how to detect the EOF (Without
>> using
>> some ugly code) and pass the hole *xml to the Parser.
>
> I'm still not exactly sure of your exact context, but you don't have
> to detect the EOF, just parse and when you reach the end of the
> document close the pipe yourself on your end.

Just for fun, I tried hacking something together using the pull parser
that I wrote. This pointed out one possible issue that is confusing,
I'll get to that in a second.

How to avoid waiting for an EOF? Count events. Crudely, if you
increment the count on a start element event, and decrement on an end
element, when the count goes to zero, you've got what you are looking
for. This means you are letting the pull parser read the input, you
don't do it for the parser.

The issue I mentioned... In my pull parser I'm assuming a file or
string input, not an IO stream. I take advantage of that by looking
ahead a bit. This isn't a problem unless you are using a stream. In my
parser's case, it is looking ahead to at least the end of the next
line (huge performance thing with files). The confusing effect is with
the stream input:

<SChange>
<Service>Testing:service</Service>
<Status>Critical</Status>
<Mode>Normal</Mode>
</SChange>
<SChange>
<Service>Testing:service</Service>
<Status>Critical</Status>
<Mode>Normal</Mode>
</SChange>

The close of the first SChange element isn't reported until the next
line is read, which happens to include the start of the next element.
This is a delayed effect that is maybe not the best for a stream
input. If you add a blank line between the events the problem goes
away (but it'll read the blank line before reporting which shouldn't
be a problem).

It is possible that this is affecting your testing.

Cheers,
Bob



>
>
> --
> Luc Heinrich - luc@honk-honk.com
>
>

----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://www.recursive.ca/hutch
134ea397777886d6f0aa992672a50eaa?d=identicon&s=25 Mark Thomas (Guest)
on 2008-12-12 16:25
(Received via mailing list)
If I understand correctly, you want to keep an IO stream open, and
react to certain elements as they appear? That's a textbook SAX case,
not pull-parsing. Register a SAX handler for your SChange events, and
point your IO stream at it.

I'd use libxml-ruby, but REXML has a stream parser than is SAX-like.
You'd use it something like this (untested)

require "rexml/document"
require "rexml/streamlistener"
include REXML

class Handler
  include StreamListener
  def tag_start name, attrs
    if name=="SChange"
      #do something
      puts attrs
    end
  end
end

Document.parse_stream(your_io_stream, Handler.new)

-- Mark.
This topic is locked and can not be replied to.