Hello folks.
I am trying to build a simple XML parser to extract data from IBM
translation manager memories. Here is a sample os such memory files:
0000000001
00012200000001178876638English(U.S.)ITALIANIBMIDDOCBB1CTMST.
000BB1CTmst.idd
0000000002
00000300000001178876638English(U.S.)ITALIANIBMIDDOCCONFIGUR.
000Configuration_PDSG.IDE
Configuration information and guidelines
Informazioni e istruzioni per la configurazione
etc...
These memory files are quite similar to XML files, but I suspect they
actually conform to another standard. In fact, they often include
“opened” tags; these because they store segments of translation; thus,
when the translation is referred to a website or a SGML document, the
original HTML or SGML might be split in two or more parts. So I often
encounter faulty segments; open tags generate a REXML fault.
My code is quite simple :
require ‘rexml/document’
require ‘rexml/streamlistener’
include REXML
class Listener
include StreamListener
$segment = “”
$result = “”
$is_there = false
def tag_start(name, attributes)
if name == “Source”
$segment << “EN:”
end
if name == “Target”
$segment << “IT:”
end
end
def tag_end(name)
if name == “Target”
if $is_there
$result << $segment
end
$segment = “”
$is_there = false
end
if name == “NTMMemoryDb”
puts $result
end
end
def text(text)
$segment << text
if text =~ /blade/
$is_there = true
end
end
end
listener = Listener.new
parser =
Parsers::StreamParser.new(File.new(“bch01aad006_MEMORIA.EXP”),
listener)
parser.parse
I need to bypass mistakes, and tell StreamListener: “when you
encounter a faulty segment, don’t bother!”
How do I achieve this?
Thanks in advance,
Davide
nutsmuggler wrote:
Hello folks.
I am trying to build a simple XML parser to extract data from IBM
translation manager memories. Here is a sample os such memory files:
…
I need to bypass mistakes, and tell StreamListener: “when you
encounter a faulty segment, don’t bother!”
How do I achieve this?
Don’t use an XML parser to handle non-XML?
Alternatively, have you tried the REXML pull parser? A bit more work in
that you have to explicitly pop items off the tag stack, but it may have
better options for recovering from bad markup.
However, the underlying parser may still barf in trying to segment the
source into tags and such.
Also, I don’t know if Hpricot is happy with non-HTML, but it’s worth a
shot to see if it can read and “fix” the source before you pass it to
another parser. You’ll want to check that any modification made to the
input do not change the essential semantics.
(Or perhaps you could just use Hpricot and extract data with XPath)
–
James B.
“In Ruby, no one cares who your parents were, all they care
about is if you know what you are talking about.”
hpricot is my man
Being an HTML parser, it’s much less hard to please.
Here is the basic code I am using:
require ‘rubygems’
require ‘hpricot’
doc = Hpricot.XML(open(“bch01aad006_MEMORIA.EXP”))
doc.search(“Source”).each do |item|
if item.innerHTML =~ /firmware/
puts “EN: #{item}”
puts “IT: #{item.next_sibling}”
end
end
The principle it’s quite easy, and much more coincise than the rexml
solution.
Thanks a million for the tip.
Davide