XML parser; maybe ruby is too slow?

Hello folks.
I managed to write a SGML parser with the hpricot library. As I
explained in a previous thread, I just need to compare source and
traget tags of translation memory files from IBM Translation manager.
The script now runs effectively, but I realised that it cannot cope
with large files; I tried to process TM file larger than 1MB and the
script took ages to generate the output. Should I switch to a compiled
language for this specific task?
At any rate, here is the script, it’s very basic; please let me know
if I did something wrong or if its slowness is a necessary drawback of
ruby being interpreted. Cheers,
Davide

#!/usr/local/bin/ruby
require ‘rubygems’
require ‘hpricot’

$pattern = “server”
result = File.new(“result.html”, “w”)
$stdout = result
puts "\n

\n \n Ricerca di '#{$pattern}'\n body { } p { margin: 0px; } p.source { background: #FFFFCC; padding: 10px 5px 10px 5px; } p.target { background: #F8A271; padding: 10px 5px 10px 5px; } span.pattern { background: #B6B6B6; } \n \n" # per aprire lo stdin # doc = Hpricot.XML(STDIN)

doc = Hpricot.XML(open(“bch01aad006_MEMORIA.EXP”))
doc.search(“Source”).each do |item|
if item.innerHTML =~ /#{$pattern}/
highlightedSource = item.innerHTML.gsub(/#{$pattern}/, “#{$pattern}”)
puts “

EN: #{highlightedSource}

\n”
puts "

IT: #{item.next_sibling.html}

\n

"
end
end
puts “”

On 9/15/07, nutsmuggler [email protected] wrote:

if I did something wrong or if its slowness is a necessary drawback of
ruby being interpreted. Cheers,
Davide

Davide,

This is not a result necessarily of ruby being slow. Hpricot is a DOM
parser, and also (by default ) tries to fix up tags. This will parse
the
entire file to memory and build an internal tree structure out of it.
The
alternative is to use a SAX based or streaming parser. This is what
happened in a part of Merb. The streaming parser, REXML, was much
faster
than Hpricot for the same job because it is, well a streaming parser
instead
of a DOM one.

It is my understanding that a streaming parser is best for large files,
so
if you can use one of these.

Cheers
Daniel

#!/usr/local/bin/ruby

On Sep 15, 7:49 am, nutsmuggler [email protected] wrote:

ruby being interpreted. Cheers,
http://www.w3.org/TR/html4/strict.dtd’>\n
background: #FFFFCC;

\n puts "

IT: #{item.next_sibling.html}

\n
" end end puts ""

I haven’t done any comparison testing, but if your *.EXP files are
truly XML, Ruby libxml might be a better choice as it’s just a wrapper
around the libxml2 library (see http://libxml.rubyforge.org/).

Jeremy

On 16 Set, 05:56, “[email protected][email protected] wrote:

script took ages to generate the output. Should I switch to a compiled
$pattern = “server”
p {
span.pattern {
if item.innerHTML =~ /#{$pattern}/
truly XML, Ruby libxml might be a better choice as it’s just a wrapper
around the libxml2 library (seehttp://libxml.rubyforge.org/).

Jeremy

The problem is the EXP file are actually SGML; I could not parse them
with REXML precisely because they are not well formed XML: they
contains open tag, whoch are apparently valid in some SGML format, but
not in XML. That is why I had to use hpricot, which is less picky.
Cheers,
Davide

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs