Forum: Ruby-core [ruby-trunk - Bug #7156][Open] Invalid byte sequence in US-ASCII when using URI from std lib

Posted by t0d0r (Todor Dragnev) (Guest)
on 2012-10-14 01:53
(Received via mailing list)
Issue #7156 has been reported by t0d0r (Todor Dragnev).

----------------------------------------
Bug #7156: Invalid byte sequence in US-ASCII when using URI from std lib
https://bugs.ruby-lang.org/issues/7156

Author: t0d0r (Todor Dragnev)
Status: Open
Priority: Normal
Assignee:
Category: lib
Target version:
ruby -v: 1.9.3


Invalid byte sequence in US-ASCII on ruby 1.9.3

I receive that error when trying to open url with bulgarian text (utf-8: 
"История"). It seems that the problem is in uri/common.rb from ruby 
standard library...

adding str.force_encoding(Encoding::BINARY) to following method fix the 
problem

class URI::Parser
  def escape(str, unsafe = @regexp[:UNSAFE])
    unless unsafe.kind_of?(Regexp)
      # perhaps unsafe is String object
      unsafe = Regexp.new("[#{Regexp.quote(unsafe)}]", false)
    end
    str.force_encoding(Encoding::BINARY) # FIX
    str.gsub(unsafe) do
      us = $&
        tmp = ''
      us.each_byte do |uc|
        tmp << sprintf('%%%02X', uc)
      end
      tmp
    end.force_encoding(Encoding::US_ASCII)
  end
end

One more suggestion -  maybe US_ASCII must be replaced to 
Encoding::BINARY too?
Posted by meta (mathew murphy) (Guest)
on 2012-10-15 21:15
(Received via mailing list)
Issue #7156 has been updated by meta (mathew murphy).


What part of the URL contains the UTF-8 characters?

If it's the domain, you need to decode the UTF-8 into punycode before 
passing it to Ruby.

It it's in the path, Ruby ought to handle it for IRI compliance, but 
probably doesn't right now...

http://www.w3.org/International/articles/idn-and-iri/
----------------------------------------
Bug #7156: Invalid byte sequence in US-ASCII when using URI from std lib
https://bugs.ruby-lang.org/issues/7156#change-30788

Author: t0d0r (Todor Dragnev)
Status: Open
Priority: Normal
Assignee:
Category: lib
Target version:
ruby -v: 1.9.3


Invalid byte sequence in US-ASCII on ruby 1.9.3

I receive that error when trying to open url with bulgarian text (utf-8: 
"История"). It seems that the problem is in uri/common.rb from ruby 
standard library...

adding str.force_encoding(Encoding::BINARY) to following method fix the 
problem

class URI::Parser
  def escape(str, unsafe = @regexp[:UNSAFE])
    unless unsafe.kind_of?(Regexp)
      # perhaps unsafe is String object
      unsafe = Regexp.new("[#{Regexp.quote(unsafe)}]", false)
    end
    str.force_encoding(Encoding::BINARY) # FIX
    str.gsub(unsafe) do
      us = $&
        tmp = ''
      us.each_byte do |uc|
        tmp << sprintf('%%%02X', uc)
      end
      tmp
    end.force_encoding(Encoding::US_ASCII)
  end
end

One more suggestion -  maybe US_ASCII must be replaced to 
Encoding::BINARY too?
Posted by mame (Yusuke Endoh) (Guest)
on 2012-11-06 12:46
(Received via mailing list)
Issue #7156 has been updated by mame (Yusuke Endoh).

File bulgarian.rb added
Status changed from Open to Feedback
Target version set to 2.0.0

I'm not sure what you want.  I cannot reproduce this issue by the 
following code.

    $ cat bulgarian.rb
    # coding: UTF-8
    require "uri"
    p URI.escape("История")

    $ ruby bulgarian.rb
    "%D0%98%D1%81%D1%82%D0%BE%D1%80%D0%B8%D1%8F"

Could you please tell us a example code, expected result and actual one?

--
Yusuke Endoh <mame@tsg.ne.jp>
----------------------------------------
Bug #7156: Invalid byte sequence in US-ASCII when using URI from std lib
https://bugs.ruby-lang.org/issues/7156#change-32489

Author: t0d0r (Todor Dragnev)
Status: Feedback
Priority: Normal
Assignee:
Category: lib
Target version: 2.0.0
ruby -v: 1.9.3


Invalid byte sequence in US-ASCII on ruby 1.9.3

I receive that error when trying to open url with bulgarian text (utf-8: 
"История"). It seems that the problem is in uri/common.rb from ruby 
standard library...

adding str.force_encoding(Encoding::BINARY) to following method fix the 
problem

class URI::Parser
  def escape(str, unsafe = @regexp[:UNSAFE])
    unless unsafe.kind_of?(Regexp)
      # perhaps unsafe is String object
      unsafe = Regexp.new("[#{Regexp.quote(unsafe)}]", false)
    end
    str.force_encoding(Encoding::BINARY) # FIX
    str.gsub(unsafe) do
      us = $&
        tmp = ''
      us.each_byte do |uc|
        tmp << sprintf('%%%02X', uc)
      end
      tmp
    end.force_encoding(Encoding::US_ASCII)
  end
end

One more suggestion -  maybe US_ASCII must be replaced to 
Encoding::BINARY too?
Posted by ko1 (Koichi Sasada) (Guest)
on 2013-02-17 06:02
(Received via mailing list)
Issue #7156 has been updated by ko1 (Koichi Sasada).

Target version changed from 2.0.0 to next minor

No feedback.

----------------------------------------
Bug #7156: Invalid byte sequence in US-ASCII when using URI from std lib
https://bugs.ruby-lang.org/issues/7156#change-36365

Author: t0d0r (Todor Dragnev)
Status: Feedback
Priority: Normal
Assignee:
Category: lib
Target version: next minor
ruby -v: 1.9.3


Invalid byte sequence in US-ASCII on ruby 1.9.3

I receive that error when trying to open url with bulgarian text (utf-8: 
"История"). It seems that the problem is in uri/common.rb from ruby 
standard library...

adding str.force_encoding(Encoding::BINARY) to following method fix the 
problem

class URI::Parser
  def escape(str, unsafe = @regexp[:UNSAFE])
    unless unsafe.kind_of?(Regexp)
      # perhaps unsafe is String object
      unsafe = Regexp.new("[#{Regexp.quote(unsafe)}]", false)
    end
    str.force_encoding(Encoding::BINARY) # FIX
    str.gsub(unsafe) do
      us = $&
        tmp = ''
      us.each_byte do |uc|
        tmp << sprintf('%%%02X', uc)
      end
      tmp
    end.force_encoding(Encoding::US_ASCII)
  end
end

One more suggestion -  maybe US_ASCII must be replaced to 
Encoding::BINARY too?
Posted by ko1 (Koichi Sasada) (Guest)
on 2013-02-18 01:14
(Received via mailing list)
Issue #7156 has been updated by ko1 (Koichi Sasada).

Assignee set to naruse (Yui NARUSE)


----------------------------------------
Bug #7156: Invalid byte sequence in US-ASCII when using URI from std lib
https://bugs.ruby-lang.org/issues/7156#change-36469

Author: t0d0r (Todor Dragnev)
Status: Feedback
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: lib
Target version: next minor
ruby -v: 1.9.3


Invalid byte sequence in US-ASCII on ruby 1.9.3

I receive that error when trying to open url with bulgarian text (utf-8: 
"История"). It seems that the problem is in uri/common.rb from ruby 
standard library...

adding str.force_encoding(Encoding::BINARY) to following method fix the 
problem

class URI::Parser
  def escape(str, unsafe = @regexp[:UNSAFE])
    unless unsafe.kind_of?(Regexp)
      # perhaps unsafe is String object
      unsafe = Regexp.new("[#{Regexp.quote(unsafe)}]", false)
    end
    str.force_encoding(Encoding::BINARY) # FIX
    str.gsub(unsafe) do
      us = $&
        tmp = ''
      us.each_byte do |uc|
        tmp << sprintf('%%%02X', uc)
      end
      tmp
    end.force_encoding(Encoding::US_ASCII)
  end
end

One more suggestion -  maybe US_ASCII must be replaced to 
Encoding::BINARY too?
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.