Forum: Ruby on Rails Recovering in Ruby-libxml parser from invalid UTF8 code

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
96146b7a23174e2e024c06a49f845bb8?d=identicon&s=25 Jeffrey L. Taylor (Guest)
on 2009-04-03 03:47
(Received via mailing list)
I am parsing XML streams with ruby-libxml using the XML::Reader class.
Several have invalid UTF-8 characters.  I need a tutorial or at least
some
hints on how to recover and continue the parsing.

TIA,
  Jeffrey
Aafa8848c4b764f080b1b31a51eab73d?d=identicon&s=25 Phlip (Guest)
on 2009-04-03 04:14
(Received via mailing list)
Jeffrey L. Taylor wrote:
> I am parsing XML streams with ruby-libxml using the XML::Reader class.
> Several have invalid UTF-8 characters.  I need a tutorial or at least some
> hints on how to recover and continue the parsing.

Why not scrub them with Ruby's built-in iconv first?

And what are they doing to you and ruby-libxml? I have found libxml2
suspiciously forgiving, so far...
96146b7a23174e2e024c06a49f845bb8?d=identicon&s=25 Jeffrey L. Taylor (Guest)
on 2009-04-03 04:26
(Received via mailing list)
Quoting Phlip <phlip2005@gmail.com>:
>
Throws an exception.  It took a bunch of digging to find line: 835,
character:
418 is truely not an UTF-8 character (octal 240, maybe a Latin-1
character?).
I'd like to delete or replace it with a question mark and continue
parsing.
It is a rather large file so I'd rather not read the whole thing into
memory
to correct.  I suppose I could wrap the read function in a clean up
function.
Messy trying to keep state for UTF-8 across partial reads.

I was hoping for something better.

Jeffrey
This topic is locked and can not be replied to.