Ruby, Unicode, and HTML Entities Problem

mrpeepers · September 26, 2010, 5:54pm

Hi all,

Using Ruby (and REXML) to parse a directory full of HTML files mixed
with Spanish but mostly English.

For the most part, Ruby can correctly parse the HTML. That’s all fine
and dandy, BUT, when there’s a unicode character NEAR an HTML entity,
the parser bombs with:

Missing end tag for ‘em’ (got “p”)
Or
Missing end tag for ‘em’ (got “em”)
Or
Missing end tag for ‘em’ (got “body”)
etc, etc., etc…

The errors are all related based on how/where the unicode characters are
placed throughout the documents.

Here’s an example of what the parser does NOT like and throws errors
like what I posted above:
Ministerio PÃºblico de la FederaciÃ³n

But just for fun, I changed the string to be this:
Ministerio PÃºblico de la FederaciÃ³nnn

…and it loads/parses, no issue.

What I have noticed is that when there’s a unicode character SO CLOSE to
the begin/start tag =, i.e., , , etc., etc… The parser
bombs. But if I move the unicode character around FURTHER inside the
tags, the parser loads the data without error.

What’s going on?

I can’t really post all the code due to company privacy. But here’s a
listed of the requireds if it at all helps:
require ‘rubygems’
require ‘sqlite3’
require ‘rexml/document’
require ‘rexml/streamlistener’
include REXML
require ‘zlib’
require ‘CGI’
require ‘osx/cocoa’
include OSX

And here’s some parsing if at all helps, I know, I know, probably
not…:
@title << CGI::unescapeHTML(content.strip)
title.gsub(’&’,’&’).gsub(’<’,’<’).gsub(’>’,’>’).gsub(’’’,’’’).gsub(’"’,’"’).gsub(’§’,‘Â§’)

@body << " #{attr_name}="#{attr_value}""
@body << “</#{name}>”

mrpeepers · September 26, 2010, 8:52pm

On 26.09.2010 17:54, Mr Peepers wrote:

Here’s an example of what the parser does NOT like and throws errors
like what I posted above:
Ministerio PÃºblico de la FederaciÃ³n

But just for fun, I changed the string to be this:
Ministerio PÃºblico de la FederaciÃ³nnn

…and it loads/parses, no issue.

I can’t really post all the code due to company privacy.

No one wants your company data, but above you pasted a problematic
snippet. Based on this, can’t you create a minimal test case from it?

I was having similar problems with Nokogiri until I figured I need to
set encoding manually because the automatic detection failed for some
reason. But it’s premature to suggest anything without a smaller/real
test case.

Markus

mrpeepers · September 26, 2010, 9:08pm

Ok, get the Ruby and a sample of the data attempting to be loaded here,
www.khourys.com/Archive.rar.

Regarding 150340.html, as just an example.

I’ll walk you through my troubleshooting quickly.

If I replace this text (where it fails):
Ministerio PÃºblico de la FederaciÃ³n

with this:
Ministerio PÃºblico de la FederaciÃ³nnn

FILE LOADS!

So I’m thinking, did this one o-acute get corrupted? Nope. But, if
you remove this sentence:
Whenever the Federal Public Ministry (Ministerio PÃºblico de la
FederaciÃ³n) investigates the activities of organized crime members
that deal with goods of illicit origin, the investigation is carried out
with the assistance of the Secretariat of Finance and Public Credit
(SecretarÃa de la Hacienda y CrÃ©dito PÃºblico).

FILE LOADS!

I replaced this:
Ministerio PÃºblico de la FederaciÃ³n
with this
Ministerio PÃºblico de la FederaciÃ³n
(removed the 's)

FILE LOADS!

Now I’m thinking, hhhmmmm… (actually, WTF) Could the unicode
character so close to the end HTML tag be causing the issue,
potentially?

To test my theory, I’ll modify these lines together in the file:

Before:
Ministerio PÃºblico de la FederaciÃ³n
SecretarÃa de la Hacienda y CrÃ©dito PÃºblico

After
Ministerio PÃºblico de la FederaciÃ³nnnn ← this to prove it’s
not JUST this line
SecretarÃa de la Hacienda y CrÃ©dito PÃº ← this to prove that
when a unicode character is so close to an HTML entity - we’re screwed.

FILE BOMBED!

Think we have the culprit?

mrpeepers · September 26, 2010, 10:11pm

Hi,

On 26.09.2010 21:08, Mr Peepers wrote:

Ok, get the Ruby and a sample of the data attempting to be loaded here,
www.khourys.com/Archive.rar.

Regarding 150340.html, as just an example.

Not exactly a small test case, the 12kb load.rb . I was more talking
along the line of you providing a simple script to show a reproducible
error.

Anyway, I simply fired up rexml to do basic thing but it bombs
immediately. Disclaimer: I never used rexml before:

$ ruby -rrexml/document -e ‘REXML::Document.new File.new(“150340.html”)’
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parseexception.rb:31:in
gsub': invalid byte sequence in UTF-8 (ArgumentError) from /home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parseexception.rb:31:in to_s’
from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in
message' from /home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in rescue in parse’
from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:20:in
parse' from /home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/document.rb:230:in build’
from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/document.rb:43:in
initialize' from -e:1:in new’
from -e:1:in `’

Anyway, the same using Nokogiri works for me:

$ ruby -rnokogiri -e ‘puts Nokogiri::HTML( File.open(“150340.html”)
)/“//p/em[3]”’
Ministerio PÃºblico de la FederaciÃ³n

Does that help you? If not, you should provide your small rexml test
case which bombs.

HTH,

Markus

mrpeepers · September 26, 2010, 10:14pm

Not really… Think we figured out the problem. UTF-8 is a variable
length encoding. We seem to have a character in there that’s indicating
further encoding but it’s not so it pukes. Somewhere along the line the
encoding of the actual file got messed up.

Need to close this topic.