Simone C. wrote:
If I’m right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I
can convert them in the same way using Iconv.iconv(‘UTF-8’, ‘LATIN1’,
You’ll probably loose the â‚¬ (euro) sign from ISO-8859-15 sources as
LATIN1 is probably equivalent to ISO-8859-1.
My goal is not to be able to detect each single different charset but
to convert all string from an input into UTF-8.
In fact… it’s the same if you don’t know the original charset you
can’t convert properly to UTF-8.
In the meantime I was reading the code of rFeedParser, the Ruby
implementation of Python FeedParser.
I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/
I gave it a look and it seems to do exactly what I was looking for.
Anyone is using this library?
I use chardet 0.9.0. I believe they work more or less the same.
I use it as a fallback mechanism when I can’t reliably get the original
charset from feeds. Some feeds actually tell that they are UTF-8 encoded
but have invalid code points (your database isn’t happy when you try to
feed it something like that…), this becomes a mess when you find out
that each item in the feed may use different charsets because people
aggregate different sources without checking their charset themselves…
The behavior I’m using is :
1/ Try the advertised charset with Iconv(‘utf-8’, charset), even if
charset =~ /^utf-?8$/i
succeeds? -> END
fails? (Exception) -> continue
2/ Use chardet to guess the charset,
3/ Iconv(‘utf-8’, chardet_charset).
Good luck, you’re in for a lot of pain…