Bypassing XML inconsistencies with REXML::StreamListener

Hello folks.
I am trying to build a simple XML parser to extract data from IBM
translation manager memories. Here is a sample os such memory files:

0000000001 00012200000001178876638English(U.S.)ITALIANIBMIDDOCBB1CTMST. 000BB1CTmst.idd 0000000002 00000300000001178876638English(U.S.)ITALIANIBMIDDOCCONFIGUR. 000Configuration_PDSG.IDE Configuration information and guidelines Informazioni e istruzioni per la configurazione etc...

These memory files are quite similar to XML files, but I suspect they
actually conform to another standard. In fact, they often include
“opened” tags; these because they store segments of translation; thus,
when the translation is referred to a website or a SGML document, the
original HTML or SGML might be split in two or more parts. So I often
encounter faulty segments; open tags generate a REXML fault.
My code is quite simple :

require ‘rexml/document’
require ‘rexml/streamlistener’
include REXML

class Listener
include StreamListener
$segment = “”
$result = “”
$is_there = false
def tag_start(name, attributes)
if name == “Source”
$segment << “EN:”
end
if name == “Target”
$segment << “IT:”
end
end
def tag_end(name)
if name == “Target”
if $is_there
$result << $segment
end
$segment = “”
$is_there = false
end
if name == “NTMMemoryDb”
puts $result
end
end
def text(text)
$segment << text
if text =~ /blade/
$is_there = true
end
end
end

listener = Listener.new
parser =
Parsers::StreamParser.new(File.new(“bch01aad006_MEMORIA.EXP”),
listener)
parser.parse

I need to bypass mistakes, and tell StreamListener: “when you
encounter a faulty segment, don’t bother!”
How do I achieve this?
Thanks in advance,
Davide

nutsmuggler wrote:

00012200000001178876638English(U.S.)ITALIANIBMIDDOCBB1CTMST.

These memory files are quite similar to XML files, but I suspect they
actually conform to another standard. In fact, they often include
“opened” tags; these because they store segments of translation; thus,
when the translation is referred to a website or a SGML document, the
original HTML or SGML might be split in two or more parts. So I often
encounter faulty segments; open tags generate a REXML fault.

It might be worth trying HTML Tidy in XML mode. I can’t remember off
the top of my head how it’ll react to missing close tags, but it’s worth
a shot…

nutsmuggler wrote:

Hello folks.
I am trying to build a simple XML parser to extract data from IBM
translation manager memories. Here is a sample os such memory files:

I need to bypass mistakes, and tell StreamListener: “when you
encounter a faulty segment, don’t bother!”
How do I achieve this?

Don’t use an XML parser to handle non-XML?

Alternatively, have you tried the REXML pull parser? A bit more work in
that you have to explicitly pop items off the tag stack, but it may have
better options for recovering from bad markup.

However, the underlying parser may still barf in trying to segment the
source into tags and such.

Also, I don’t know if Hpricot is happy with non-HTML, but it’s worth a
shot to see if it can read and “fix” the source before you pass it to
another parser. You’ll want to check that any modification made to the
input do not change the essential semantics.

(Or perhaps you could just use Hpricot and extract data with XPath)


James B.

“In Ruby, no one cares who your parents were, all they care
about is if you know what you are talking about.”

  • Logan C.

hpricot is my man :slight_smile:
Being an HTML parser, it’s much less hard to please.
Here is the basic code I am using:

require ‘rubygems’
require ‘hpricot’

doc = Hpricot.XML(open(“bch01aad006_MEMORIA.EXP”))
doc.search(“Source”).each do |item|
if item.innerHTML =~ /firmware/
puts “EN: #{item}”
puts “IT: #{item.next_sibling}”
end
end

The principle it’s quite easy, and much more coincise than the rexml
solution.
Thanks a million for the tip.
Davide