Ruby Forum Ruby > same character show different code in two machine

Posted by Ryan Smith (sunraise2005)
on 07.02.2010 18:37
one chinese character show different code in two different machine.

machine A:  \243\244
machine B:  \302\245

so I have to using different pattern for two machines, like this:
machine A:    text.split("\243\244")
machine B:    text.split("\302\245")

I know this is not the proper way, but could some one tell me:
  what is the root course?
  What different between machine A and B?
  what is the proper way to handle this ?

thanks very much!

-ryan
Posted by Paul Harrington (pharrington)
on 07.02.2010 19:57
Ryan Smith wrote:
> one chinese character show different code in two different machine.
> 
> machine A:  \243\244
> machine B:  \302\245
> 
> so I have to using different pattern for two machines, like this:
> machine A:    text.split("\243\244")
> machine B:    text.split("\302\245")
> 
> I know this is not the proper way, but could some one tell me:
>   what is the root course?
>   What different between machine A and B?
>   what is the proper way to handle this ?
> 
> thanks very much!
> 
> -ryan

Are the two machines getting the text from the same source?
What is the source?
What is the encoding of the source?
What locale is Ruby running under?
What version of Ruby are you running?
What encoding is the relevant Ruby source file set to?
How are you retrieving the text from the source?
What encoding is the string once its finally retrieved?
Posted by Ryan Smith (sunraise2005)
on 08.02.2010 01:50
  Are the two machines getting the text from the same source?
yes, from a same public web page. like http://www.abc.com/xyz.htm

 What is the source?
as above, source is webpage, using Watir to parse the webpage.

 What is the encoding of the source?
text/html; charset=utf-8

 What locale is Ruby running under?
you mean  Windows system locale?
Machine A is "Win XP Pro (English version)", system locale is "PRC"
Machine B is "Win XP Pro (Chinese version)", system locale is "中文(中国)"
So in my point, A and B has indentical system locale.

 What version of Ruby are you running?
Machine A and B both are
ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]

 What encoding is the relevant Ruby source file set to?
A and B using same set of ruby script, their are in utf-8 encoding. both 
has this line in the head of script.
  # -*- coding: utf-8 -*-


 How are you retrieving the text from the source?
Using Watir to parse text in the webpage.


 What encoding is the string once its finally retrieved?
I have no idea with this, How to get this?
Posted by Walton Hoops (vyper)
on 08.02.2010 05:34
(Received via mailing list)
On 2/7/2010 5:51 PM, Ryan Smith wrote:
> What encoding is the string once its finally retrieved?
> I have no idea with this, How to get this?
>    

irb(main):001:0> "Hello".encoding
=> #<Encoding:IBM437>

Walton
Posted by Ryan Smith (sunraise2005)
on 08.02.2010 07:26
Thanks, Walton,

need include something?

irb(main):006:0> "Hello".encoding
NoMethodError: undefined method `encoding' for "Hello":String
        from (irb):6
Posted by Brian Candler (candlerb)
on 08.02.2010 11:20
Ryan Smith wrote:
> one chinese character show different code in two different machine.
> 
> machine A:  \243\244
> machine B:  \302\245

In hex those are: \xa3\xa4
                  \xc2\xa5

The first is not valid UTF-8. I suppose it might be UTF-16: U+A3A4 or 
U+A4A3 depending on little or big-endian. Or it could be some older 
proprietary Asian encoding.

The second of these could be UTF-8. If so it would be codepoint 165, the 
'yen' symbol. Or it could be U+C2A5 in UTF-16.
Posted by Marnen Laibow-Koser (marnen)
on 08.02.2010 15:20
Ryan Smith wrote:
> Thanks, Walton,
> 
> need include something?
> 
> irb(main):006:0> "Hello".encoding
> NoMethodError: undefined method `encoding' for "Hello":String
>         from (irb):6

No, I don't think that method exists in 1.8.

Best,
-- 
Marnen Laibow-Koser
http://www.marnen.org
marnen@marnen.org
Posted by Ryan Smith (sunraise2005)
on 08.02.2010 17:02
Thanks Brian, see my in line comments.


> The first is not valid UTF-8. I suppose it might be UTF-16: U+A3A4 or 
> U+A4A3 depending on little or big-endian. Or it could be some older 
> proprietary Asian encoding.

[Ryan] How to correct this (to UTF-8), it is a English XP Pro with PRC 
as system locale.

> 
> The second of these could be UTF-8. If so it would be codepoint 165, the 
> 'yen' symbol. Or it could be U+C2A5 in UTF-16.


[Ryan] yes, it is chinese currency CNY 'yen' symbol.
Posted by Brian Candler (candlerb)
on 08.02.2010 18:52
Ryan Smith wrote:
>> The first is not valid UTF-8. I suppose it might be UTF-16: U+A3A4 or 
>> U+A4A3 depending on little or big-endian. Or it could be some older 
>> proprietary Asian encoding.
> 
> [Ryan] How to correct this (to UTF-8), it is a English XP Pro with PRC 
> as system locale.

Sorry, I have no idea. Are you sure that \xa3\xa4 correponds exactly to 
that one character? Is the rest of the encoding variable length or fixed 
length? (e.g. are all characters two bytes long, even a western letter 
"A"?)

Questions about Microsoft operating systems and what encodings they use 
really belong in a Microsoft users' forum, as it's not anything to do 
with Ruby.
Posted by Ryan Smith (sunraise2005)
on 08.02.2010 18:57
Brian Candler wrote:
> Ryan Smith wrote:
>>> The first is not valid UTF-8. I suppose it might be UTF-16: U+A3A4 or 
>>> U+A4A3 depending on little or big-endian. Or it could be some older 
>>> proprietary Asian encoding.
>> 
>> [Ryan] How to correct this (to UTF-8), it is a English XP Pro with PRC 
>> as system locale.
> 
> Sorry, I have no idea. Are you sure that \xa3\xa4 correponds exactly to 
> that one character? Is the rest of the encoding variable length or fixed 
> length? (e.g. are all characters two bytes long, even a western letter 
> "A"?)
> 
> Questions about Microsoft operating systems and what encodings they use 
> really belong in a Microsoft users' forum, as it's not anything to do 
> with Ruby.


I have no idea either, but I will upgrade to ruby 1.9 to leverage 
string.encoding feature. thank you.