Charset detection

Miquel_O · November 8, 2006, 10:52pm

Hi

I’m coding a planet software in Ruby, like planet planet, but inserting
all rss data into a database (mysql now) and showing the entries from
database.

I don’t know how can I detect the rss charset. Can you help me?

Thanks in advance

Kind regards

–

Miquel (a.k.a. Ton)
Linux User #286784
GPG Key : 4D91EF7F
Debian GNU/Linux (Linux Wolverine 2.6.14)

Welcome to the jungle, we got fun and games
Guns n’ Roses

LLama Gratis a cualquier PC del Mundo.
Llamadas a fijos y móviles desde 1 céntimo por minuto.
http://es.voice.yahoo.com

Miquel_O · November 9, 2006, 1:40pm

Miquel O. wrote:

Hi

I’m coding a planet software in Ruby, like planet planet, but inserting
all rss data into a database (mysql now) and showing the entries from
database.

I don’t know how can I detect the rss charset. Can you help me?

Look for the XML header? That one should list encoding.

If it doesn’t, bitch, whine, and moan at the feed author to do so,
charset detection is unavoidably a hack and shouldn’t have to be done by
now if interoperating apps are coded sanely. (And if someone don’t code
a RSS feed provider with interop in mind, there is no God anymore.)

David V.

Miquel_O · November 9, 2006, 2:36pm

On Nov 9, 2006, at 6:38 AM, David V. wrote:

Look for the XML header? That one should list encoding.

If it doesn’t, bitch, whine, and moan at the feed author to do so,
charset detection is unavoidably a hack and shouldn’t have to be
done by
now if interoperating apps are coded sanely.

Right, cause XML encoding headers never lie.

If you have the header it’s probably best to trust it. If not,
libcharguess is quite accurate, even if David labels it a “hack.”

James Edward G. II

Miquel_O · November 10, 2006, 3:18pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Miquel O., 11/08/2006 10:50 PM:

I don’t know how can I detect the rss charset. Can you help me?

The only way is to write an artificial intelligence that a) can
understand any language present in the feeds and b) tries out all
possible encodings unless it understands the text.

Given the state and evolution speed of current implementations of
artificial intelligence it can be expected that such software will be
available soon - as early as next millenium or so.

Jupp
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFVInPrhv7B2zGV08RAtSqAKCfOTh5ssyDHoV6ga8Nf3lS4eJVzQCff4Sz
7ViBT6pdRS08W7eTGeVtuB4=
=iYLE
-----END PGP SIGNATURE-----

Miquel_O · November 15, 2006, 1:41pm

On 11/9/06, David V. [email protected] wrote:

Look for the XML header? That one should list encoding.
Shouldn’t you be looking at the HTTP header instead/also?

Or just default to UTF-8 which should cover you anyway for 8859-1
loving
anglo-philes and most of the rest of the world. Though Japan can be a
bit native-charset-centric, especially the further you get from
well-resourced
web sites (hobby sites etc.). There was a larger utf-8 burden-of-effort
there,
whereas in the west being non-utf-8 is just pure laziness.