Hi, I’m having some encoding problems while parsing HTML with Nokogiri
I was first getting errors on non-breaking space characters (code
160), but managed to resolve this by setting the encoding at the top
of my script file (’# coding: utf-8’).
However now I’m trying to do simple string substitution with gsub()
and am getting the error:
invalid byte sequence in UTF-8
An example of where this is bombing is the word “PROT\xC9G” as parsed
by Nokogiri. Removing the encoding setting from my script causes the
original problems, so I seem to be stuck.
Has anybody worked through these issues successfully? Google turns up
a number of discussions without many solutions.