Hi list, I run a deep search through this group and other resources online but I have been unable to find whether is there a way to guess the charset of a string in Ruby 1.8.6. I need to ensure a string is always UTF-8 encoded but Iconv requires the developer to specify both in and out charset. On the other side, Kconv provides a #guess() method but doesn't support Latin or Western encodings. Any suggestion?
on 25.02.2008 14:26
on 25.02.2008 14:33
On Feb 25, 2008, at 14:25 , Simone Carletti wrote: > Hi list, > I run a deep search through this group and other resources online but > I have been unable to find whether is there a way to guess the charset > of a string in Ruby 1.8.6. > > I need to ensure a string is always UTF-8 encoded but Iconv requires > the developer to specify both in and out charset. > On the other side, Kconv provides a #guess() method but doesn't > support Latin or Western encodings. The best way is to be aware of the charsets in every data I/O and do the necessary housekeeping. If that's not possible, for example working on arbitrary text files, the best approximation that I am aware of in Ruby is the charguess library. -- fxn
on 25.02.2008 14:36
On Mon, Feb 25, 2008 at 8:25 AM, Simone Carletti <weppos@gmail.com> wrote: > I run a deep search through this group and other resources online but > I have been unable to find whether is there a way to guess the charset > of a string in Ruby 1.8.6. > > I need to ensure a string is always UTF-8 encoded but Iconv requires > the developer to specify both in and out charset. > On the other side, Kconv provides a #guess() method but doesn't > support Latin or Western encodings. > > Any suggestion? Kconv can guess because the encodings for the set of Asian written languages are distinctive (they don't share much with the Latin character set). What you're wanting is nearly impossible without a large body of text for analysis, and even then the best commercial programs are taking stabs at probabilities. (Here's an example: how do you tell the difference between ISO-8859-1 and ISO-8859-15 programmatically? IIRC, the only difference between them is that -15 supports the Euro symbol, replacing a different symbol from -1.) You're better off seeking a slightly different approach. -austin
on 25.02.2008 14:45
On Feb 25, 2:35 pm, Austin Ziegler <halosta...@gmail.com> wrote: > > Any suggestion? > You're better off seeking a slightly different approach. > > -austin > -- > Austin Ziegler * halosta...@gmail.com *http://www.halostatue.ca/ > * aus...@halostatue.ca *http://www.halostatue.ca/feed/ > * aus...@zieglers.ca If I'm right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I can convert them in the same way using Iconv.iconv('UTF-8', 'LATIN1', 'a string').join. My goal is not to be able to detect each single different charset but to convert all string from an input into UTF-8. In the meantime I was reading the code of rFeedParser, the Ruby implementation of Python FeedParser. I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/ I gave it a look and it seems to do exactly what I was looking for. Anyone is using this library?
on 25.02.2008 15:32
Simone Carletti wrote: > > If I'm right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I > can convert them in the same way using Iconv.iconv('UTF-8', 'LATIN1', > 'a string').join. > > You'll probably loose the € (euro) sign from ISO-8859-15 sources as LATIN1 is probably equivalent to ISO-8859-1. > My goal is not to be able to detect each single different charset but > to convert all string from an input into UTF-8. > > In fact... it's the same if you don't know the original charset you can't convert properly to UTF-8. > In the meantime I was reading the code of rFeedParser, the Ruby > implementation of Python FeedParser. > I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/ > > I gave it a look and it seems to do exactly what I was looking for. > > Anyone is using this library? > > I use chardet 0.9.0. I believe they work more or less the same. I use it as a fallback mechanism when I can't reliably get the original charset from feeds. Some feeds actually tell that they are UTF-8 encoded but have invalid code points (your database isn't happy when you try to feed it something like that...), this becomes a mess when you find out that each item in the feed may use different charsets because people aggregate different sources without checking their charset themselves... The behavior I'm using is : 1/ Try the advertised charset with Iconv('utf-8', charset), even if charset =~ /^utf-?8$/i succeeds? -> END fails? (Exception) -> continue 2/ Use chardet to guess the charset, 3/ Iconv('utf-8', chardet_charset). Good luck, you're in for a lot of pain... Lionel
on 25.02.2008 21:10
> I use it as a fallback mechanism when I can't reliably get the original charset from feeds. That's a great example, thank you. Unfortunately I don't have a real charset header to check. :( I must rely only on input string. On Feb 25, 3:32 pm, Lionel Bouton <lionel-subscript...@bouton.name> wrote: > > Good luck, you're in for a lot of pain... > > Lionel Thanks, Lionel! :D
on 26.02.2008 03:37
On 25/02/2008, Simone Carletti <weppos@gmail.com> wrote: > > I use it as a fallback mechanism when I can't reliably get the original > charset from feeds. > > > That's a great example, thank you. > Unfortunately I don't have a real charset header to check. :( I must > rely only on input string. You can ask a crystal ball as well. The multibyte encodings can be often distinguished by their structure - utf-8, perhaps utf-16, the Asian encodings. If something passes for a valid string in a multibyte encoding it very likely is a string in that encoding. However, the Latin 8bit encodings are all the same - 7bit ascii with some mess attached in the upper 128 characters. By converting from any of these you get perfectly valid utf-8 but different gibberish each time. You can tell the ISO variant from the Windows variant sometimes because some control characters are at different positions - and these should not appear in text. But that does not help you at all - you still don't know which of the latin encodings you got. If you know the language (and it's one of the few supported) you can use enca. If the language is not supported you can do the filter yourself - basically you collect the set of accented (with 8th bit set) characters in your language, and encode them in different encodings (the dos and windows codepage, the iso encoding, any other legacy encodings). You get sets of bytes that would usually overlap but would contain some unique bytes. When you see that byte you know what encoding you should use. Good luck :-) Michal