Text encodings

Hello,

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

Thank you.

On 10/07/06, xTRiM [email protected] wrote:

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

You can’t detect one-byte-per-character encodings easily (i.e. without
statistical analysis) but you can easily tell if something’s UTF-8 or
not:

class String
def is_utf8?
unpack(‘U*’)
return true
rescue
return false
end
end

“foo”.is_utf8? #=> true
“foo\303”.is_utf8? #=> false

Not the most efficient way, necessarily, but probably the easiest.

Paul.

On Jul 10, 2006, at 4:47 AM, Takashi Sano wrote:

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

You can use the standard lib NKF’s guess or guess2 (ruby 1.8.2 or
later) method for that. Look up the NKF section in
RDoc Documentation.

In the general case, there’s no safe way to do this, unless the
data is XML or comes with an HTTP header from a reliable server (ha
ha ha, I’m sure there must be one somewhere). Probably the best auto-
detecter is Mark Pilgrim’s, but it’s in Python: http://
chardet.feedparser.org/

-Tim

data is XML or comes with an HTTP header from a reliable server (ha
ha ha, I’m sure there must be one somewhere). Probably the best auto-
detecter is Mark Pilgrim’s, but it’s in Python: http://
chardet.feedparser.org/

-Tim

Nice pointer, Tim. I’ll have to check that out. I did a quick web search
and found a Ruby port incidentally (I have not evaluated it in any way
though):
http://rubyforge.org/projects/chardet/ by Hui Zheng
gem name is “chardet”

Jake

Hi,

2006/7/10, xTRiM [email protected]:

Hello,

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

You can use the standard lib NKF’s guess or guess2 (ruby 1.8.2 or
later) method for that. Look up the NKF section in
RDoc Documentation.

Takashi Sano