Awesome! I must have overlooked each_char method in jcode library.
In the mean time, I modified unicode_hacks by Julik so that it would
not overload existing String methods. I attached the modified source
code at the end of this email. UTF-8 compatible equivalent methods
are prefixed by ‘u_’. For example, length of UTF-8 string is
returned by ‘u_length’ method, substring of UTF-8 string is returned
by ‘u_slice’, etc. This hack requires unicode gem.
Thank you Alex for the tip. I would use your tip for simpler needs.
daesan
ps: original unicode_hacks is available at http://
julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/
On Mar 22, 2006, at 2:22 PM, Alex Zhukov wrote:
each_char do
solve the problem, but it also seems to change the methods of
Rails mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails
Rails mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails
This is a modified version of unicode_hacks so that regular String
methods are not overloaded. Instead, UTF-8 compatible equivalent
methods are prefixed with “u_”.
begin
require ‘unicode’
Do some SUBSTANTIAL rewiring of the String class. This doesn’t
solve all of the problems
but it does solve some. And it will work in UTF-8 context only,
so we step aside
if $KCODE is not UTF-8 (Japanese people prefr JIS, right?)
Following the tradition - I am grateful to Yoshida MASATO for
the Unicode gem.
The core capabilities of String are changed by this module only
when $KCODE is set to ‘UTF8’.
Strings start to properly trim, properly strip and size, and do
many other nice things they have
been supposed to do for ages.
All “old” byte-oriented methods of Strings are still available
with “byte_” prefix (i.e. “byte_reverse”, “byte_slice”)
class String
end
unless defined?(String::UNICODE_REWIRED) # rewire only once even
if it’s reloaded
String.class_eval do
UNICODE_REWIRED = true
class <<self
# Returns a regular expression pattern that matches the
passed Unicode codepoints
def codepoints_to_pattern(array_of_codepoints)
array_of_codepoints.collect{ |e| [e].pack “U*” }.join(‘|’)
end
end
UNICODE_WHITESPACE = [
(0x0009..0x000D).to_a, # White_Space # Cc [5]
…
0x0020, # White_Space # Zs SPACE
0x0085, # White_Space # Cc
0x00A0, # White_Space # Zs NO-BREAK SPACE
0x1680, # White_Space # Zs OGHAM SPACE MARK
0x180E, # White_Space # Zs MONGOLIAN VOWEL
SEPARATOR
(0x2000…0x200A).to_a, # White_Space # Zs [11] EN
QUAD…HAIR SPACE
0x2028, # White_Space # Zl LINE SEPARATOR
0x2029, # White_Space # Zp PARAGRAPH SEPARATOR
0x202F, # White_Space # Zs NARROW NO-BREAK SPACE
0x205F, # White_Space # Zs MEDIUM
MATHEMATICAL SPACE
0x3000, # White_Space # Zs IDEOGRAPHIC SPACE
].flatten
UNICODE_LEADERS_AND_TRAILERS = UNICODE_WHITESPACE + [65279] #
ZERO-WIDTH NO-BREAK SPACE aka BOM
# Borrowed from the Kconv library by Shinji KONO - (also as
seen on the W3C site)
UTF8_PAT = /\A(?:
[\x00-\x7f] |
[\xc2-\xdf] [\x80-\xbf] |
\xe0 [\xa0-\xbf] [\x80-\xbf] |
[\xe1-\xef] [\x80-\xbf] [\x80-\xbf] |
\xf0 [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] |
[\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] |
\xf4 [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]
)*\z/xn
UNICODE_TRAILERS_PAT = /(#{codepoints_to_pattern
(UNICODE_LEADERS_AND_TRAILERS)})+$/
UNICODE_LEADERS_PAT = /^(#{codepoints_to_pattern
(UNICODE_LEADERS_AND_TRAILERS)})+/
# Performs Unicode-aware conversion to lowercase
def u_downcase
return downcase unless utf8_pragma?
Unicode::downcase(Unicode::normalize_KC(self))
end
def u_downcase! #:nodoc:
self.replace downcase
end
# Performs Unicode-aware conversion to UPPERCASE
def u_upcase
return upcase unless utf8_pragma?
Unicode::upcase(Unicode::normalize_KC(self))
end
def u_upcase! #:nodoc:
self.replace upcase
end
# Performs Unicode-aware Capitalization
def u_capitalize
capitalize unless utf8_pragma?
Unicode::capitalize(Unicode::normalize_KC(self))
end
def u_capitalize! #:nodoc:
self.replace capitalize
end
# Instead of fetching bytes will fetch the string composed of
codepoints at the specified offsets.
# The call with a single integer as argument will still return
a byte.
# If the string is not a valid UTF-8 sequence bytes will be
returned
def u_slice(*args)
return slice(*args) unless utf8_pragma?
if (args.size == 2 && args.first.is_a?(Range))
raise TypeError, 'cannot convert Range into Integer' # Do
as if we were native
elsif (args.first.is_a?(Range) or args.size == 2)
#normalize to KC so that all combined glyphs are spliced
together and ligatures split, and then…
Unicode::normalize_KC(self).unpack(“U*”).send(:slice,
args).pack("U")
else
slice(*args)
end
end
def u_index(*args)
if (args.first.is_a?(String) and !
args.first.has_utf8_semantics?) or !utf8_pragma?
return index(*args)
end
bidx = index(*args)
return nil unless bidx
return self.slice(0...bidx).unpack("U*").size
end
# Replacement for the lstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_strip
return strip unless utf8_pragma?
lstrip.rstrip
end
# Replacement for the lstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_lstrip
return lstrip unless utf8_pragma?
gsub(UNICODE_LEADERS_PAT, '')
end
# Replacement for the rstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_rstrip
return rstrip unless utf8_pragma?
gsub(UNICODE_TRAILERS_PAT, '')
end
def u_lstrip! #:nodoc:
self.replace lstrip
end
def u_rstrip! #:nodoc:
self.replace rstrip
end
def u_strip! #:nodoc:
self.replace strip
end
# Decomposes the string and returns the decomposed string
def decompose
Unicode::decompose(self)
end
# Normalizes the string to form KC and returns the result
def normalize_KC
Unicode::normalize_KC(self)
end
# Normalizes the string to form D and returns the result
def normalize_D
Unicode::normalize_D(self)
end
# Normalizes the string to form C and returns the result
def normalize_C
Unicode::normalize_C(self)
end
# Provides replacement for the size routine. Will first
normalize to KC and then return the number
# of codepoints
def u_size
return size unless utf8_pragma?
#normalize to KC so that all combiner letters are spliced
together, and then…
Unicode::normalize_KC(self).unpack(“U*”).size
end
def u_length #:nodoc:
u_size
end
# Provides replacement for the reverse routine. Will first
normalize to KC and then reverse the resulting
# codepoints
def u_reverse
return reverse unless utf8_pragma?
Unicode::normalize_KC(self).unpack("U*").reverse.pack("U*")
end
# Inserts the string at codepoint offset specified in offset.
def u_insert(offset, fragment)
return insert(offset, fragment) unless utf8_pragma?
self.replace(unpack("U*").insert(offset, fragment.unpack
(“U*”)).flatten.pack(“U*”))
end
# Returns false or true depending on whether the string has
UTF-8 semantics (a String used for purely
# byte resources is unlikely to have them).
def has_utf8_semantics?
UTF8_PAT.match(self)
end
private
def utf8_pragma?
($KCODE == 'UTF8') and (self.has_utf8_semantics?)
end
end
if defined?(RAILS_DEFAULT_LOGGER)
RAILS_DEFAULT_LOGGER.warn "Standard string functions have been
overloaded with " +
“UTF8-aware versions”
end
end
rescue LoadError
if defined?(RAILS_DEFAULT_LOGGER)
RAILS_DEFAULT_LOGGER.error "You don’t have the Unicode library
installed, most string " +
“operations will stay single-byte”
end
end