Never tested (and already coded a working chardet + Iconv
implementation).
From tidy’s documentation, it seems it would request an encoding, but
not force it, so buggy servers will still crash your code.
In fact I had to use a begin Iconv.iconv(‘utf-8’, ‘utf-8’) rescue …
end to make absolutely sure results are really utf-8 (you don’t want bad
encoding trying to enter a database set to use UTF-8…)
Thanks for the responses, but could you elaborate a little please?
at the moment I have:
Hpricot(open(uri))
(with a “require ‘open-uri’” at the top)
What do I need to do, and where?
open(uri) gives you a String in an unknown encoding. Hpricot expects
UTF-8, so you must make sure that the String you get is converted to
UTF-8, to do so you must use the Iconv library but it expects you to
know which encoding the source is in. The chardet library will be able
to guess the original encoding.
For the details, look up the documentation of chardet and Iconv. Iconv
is in the standard library, chardet is a separate download.
open(uri) returns a File, rather than a String, and after playing
with various options for detecting the encoding, I found that the file
object has a charset method, which returns the encoding (I think this
is only on a file returned by open-uri).
This was handy as chardet seemed pretty crap at detecting the encoding
correctly, it was slightly better when I tried doing it a line at a
time, but for the whole file, it just sucked. I still have a fallback
to chardet if the file object doesn’t respond to ‘charset’.
I should note that I was using rchardet as I couldn’t get the chardet
gem to play ball at all.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.