How to detect string charset

overclokkato · February 25, 2008, 2:26pm

Hi list,
I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn’t
support Latin or Western encodings.

Any suggestion?

overclokkato · February 25, 2008, 2:33pm

On Feb 25, 2008, at 14:25 , Simone C. wrote:

Hi list,
I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn’t
support Latin or Western encodings.

The best way is to be aware of the charsets in every data I/O and do
the necessary housekeeping.

If that’s not possible, for example working on arbitrary text files,
the best approximation that I am aware of in Ruby is the charguess
library.

– fxn

overclokkato · February 25, 2008, 2:36pm

On Mon, Feb 25, 2008 at 8:25 AM, Simone C. [email protected]
wrote:

I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn’t
support Latin or Western encodings.

Any suggestion?

Kconv can guess because the encodings for the set of Asian written
languages are distinctive (they don’t share much with the Latin
character set). What you’re wanting is nearly impossible without a
large body of text for analysis, and even then the best commercial
programs are taking stabs at probabilities. (Here’s an example: how do
you tell the difference between ISO-8859-1 and ISO-8859-15
programmatically? IIRC, the only difference between them is that -15
supports the Euro symbol, replacing a different symbol from -1.)

You’re better off seeking a slightly different approach.

-austin

overclokkato · February 25, 2008, 3:32pm

Simone C. wrote:

If I’m right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I
can convert them in the same way using Iconv.iconv(‘UTF-8’, ‘LATIN1’,
‘a string’).join.

You’ll probably loose the â‚¬ (euro) sign from ISO-8859-15 sources as
LATIN1 is probably equivalent to ISO-8859-1.

My goal is not to be able to detect each single different charset but
to convert all string from an input into UTF-8.

In fact… it’s the same if you don’t know the original charset you
can’t convert properly to UTF-8.

In the meantime I was reading the code of rFeedParser, the Ruby
implementation of Python FeedParser.
I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/

I gave it a look and it seems to do exactly what I was looking for.

Anyone is using this library?

I use chardet 0.9.0. I believe they work more or less the same.

I use it as a fallback mechanism when I can’t reliably get the original
charset from feeds. Some feeds actually tell that they are UTF-8 encoded
but have invalid code points (your database isn’t happy when you try to
feed it something like that…), this becomes a mess when you find out
that each item in the feed may use different charsets because people
aggregate different sources without checking their charset themselves…

The behavior I’m using is :
1/ Try the advertised charset with Iconv(‘utf-8’, charset), even if
charset =~ /^utf-?8$/i
succeeds? → END
fails? (Exception) → continue
2/ Use chardet to guess the charset,
3/ Iconv(‘utf-8’, chardet_charset).

Good luck, you’re in for a lot of pain…

Lionel

overclokkato · February 25, 2008, 9:10pm

I use it as a fallback mechanism when I can’t reliably get the original
charset from feeds.

That’s a great example, thank you.
Unfortunately I don’t have a real charset header to check. I must
rely only on input string.

On Feb 25, 3:32 pm, Lionel B. [email protected]
wrote:

Good luck, you’re in for a lot of pain…

Lionel

Thanks, Lionel!

overclokkato · February 25, 2008, 2:45pm

On Feb 25, 2:35 pm, Austin Z. [email protected] wrote:

Any suggestion?
You’re better off seeking a slightly different approach.

-austin

Austin Z. * [email protected] *http://www.halostatue.ca/
* [email protected] *You are in a maze of twisty little passages, all alike. // halo • statue
* [email protected]

If I’m right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I
can convert them in the same way using Iconv.iconv(‘UTF-8’, ‘LATIN1’,
‘a string’).join.

My goal is not to be able to detect each single different charset but
to convert all string from an input into UTF-8.

In the meantime I was reading the code of rFeedParser, the Ruby
implementation of Python FeedParser.
I just discovered it depends on a project called
https://rubyforge.org/projects/rchardet/

I gave it a look and it seems to do exactly what I was looking for.

Anyone is using this library?

overclokkato · February 26, 2008, 3:37am

On 25/02/2008, Simone C. [email protected] wrote:

I use it as a fallback mechanism when I can’t reliably get the original
charset from feeds.

That’s a great example, thank you.
Unfortunately I don’t have a real charset header to check. I must
rely only on input string.

You can ask a crystal ball as well.

The multibyte encodings can be often distinguished by their structure

utf-8, perhaps utf-16, the Asian encodings. If something passes for
a valid string in a multibyte encoding it very likely is a string in
that encoding.

However, the Latin 8bit encodings are all the same - 7bit ascii with
some mess attached in the upper 128 characters. By converting from any
of these you get perfectly valid utf-8 but different gibberish each
time. You can tell the ISO variant from the Windows variant sometimes
because some control characters are at different positions - and these
should not appear in text. But that does not help you at all - you
still don’t know which of the latin encodings you got.

If you know the language (and it’s one of the few supported) you can
use enca. If the language is not supported you can do the filter
yourself - basically you collect the set of accented (with 8th bit
set) characters in your language, and encode them in different
encodings (the dos and windows codepage, the iso encoding, any other
legacy encodings). You get sets of bytes that would usually overlap
but would contain some unique bytes. When you see that byte you know
what encoding you should use.

Good luck

Michal