Recovering in Ruby-libxml parser from invalid UTF8 code

I am parsing XML streams with ruby-libxml using the XML::Reader class.
Several have invalid UTF-8 characters. I need a tutorial or at least
some
hints on how to recover and continue the parsing.

TIA,
Jeffrey

Jeffrey L. Taylor wrote:

I am parsing XML streams with ruby-libxml using the XML::Reader class.
Several have invalid UTF-8 characters. I need a tutorial or at least some
hints on how to recover and continue the parsing.

Why not scrub them with Ruby’s built-in iconv first?

And what are they doing to you and ruby-libxml? I have found libxml2
suspiciously forgiving, so far…

Quoting P. [email protected]:

Throws an exception. It took a bunch of digging to find line: 835,
character:
418 is truely not an UTF-8 character (octal 240, maybe a Latin-1
character?).
I’d like to delete or replace it with a question mark and continue
parsing.
It is a rather large file so I’d rather not read the whole thing into
memory
to correct. I suppose I could wrap the read function in a clean up
function.
Messy trying to keep state for UTF-8 across partial reads.

I was hoping for something better.

Jeffrey

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs