How do I get substring of utf-8 string?


#1

I’m trying to get substring from a utf-8 encoded string. (say, first
50 characters of the string) String#[0…49] would give me the first
50 bytes not 50 characters…

I know there is jcode library, but it only let you count number of
characters in utf-8 string.

unicode gem doesn’t seem to help much. unicode_hacks gem seem to
solve the problem, but it also seems to change the methods of String
class directly so that it may confuse rails which expects String#[]
to give back bytes not characters.

Can somebody point out what should be the route I should take?
Should I implement substring methods myself? Have not someone
already solved this problem?

thanks,

daesan


#2

If you only need a substring you can fake it with the jcode library.
Use something like this:

$KCODE=‘u’
require ‘jcode’

class String
def usubstr a, b
i = 0
buff = ‘’
each_char do
| c |
i += 1
if i >= a: buff += c end
if i == b: return buff end
end
end
end

bla = “put here some unicode string”
puts bla.usubstr 6, 10

It works with cyrillic UTF-8 text, should work for other languages too.
I hope this helps.


best regards,
Alex Zhukov
removed_email_address@domain.invalid


#3

Awesome! I must have overlooked each_char method in jcode library.

In the mean time, I modified unicode_hacks by Julik so that it would
not overload existing String methods. I attached the modified source
code at the end of this email. UTF-8 compatible equivalent methods
are prefixed by ‘u_’. For example, length of UTF-8 string is
returned by ‘u_length’ method, substring of UTF-8 string is returned
by ‘u_slice’, etc. This hack requires unicode gem.

Thank you Alex for the tip. I would use your tip for simpler needs. :slight_smile:

daesan

ps: original unicode_hacks is available at http://
julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/

On Mar 22, 2006, at 2:22 PM, Alex Zhukov wrote:

each_char do

solve the problem, but it also seems to change the methods of


Rails mailing list
removed_email_address@domain.invalid
http://lists.rubyonrails.org/mailman/listinfo/rails


Rails mailing list
removed_email_address@domain.invalid
http://lists.rubyonrails.org/mailman/listinfo/rails

This is a modified version of unicode_hacks so that regular String

methods are not overloaded. Instead, UTF-8 compatible equivalent
methods are prefixed with “u_”.

begin
require ‘unicode’

Do some SUBSTANTIAL rewiring of the String class. This doesn’t

solve all of the problems

but it does solve some. And it will work in UTF-8 context only,

so we step aside

if $KCODE is not UTF-8 (Japanese people prefr JIS, right?)

Following the tradition - I am grateful to Yoshida MASATO for

the Unicode gem.

The core capabilities of String are changed by this module only

when $KCODE is set to ‘UTF8’.

Strings start to properly trim, properly strip and size, and do

many other nice things they have

been supposed to do for ages.

All “old” byte-oriented methods of Strings are still available

with “byte_” prefix (i.e. “byte_reverse”, “byte_slice”)
class String
end

unless defined?(String::UNICODE_REWIRED) # rewire only once even
if it’s reloaded
String.class_eval do

   UNICODE_REWIRED = true

   class <<self
     # Returns a regular expression pattern that matches the

passed Unicode codepoints
def codepoints_to_pattern(array_of_codepoints)
array_of_codepoints.collect{ |e| [e].pack “U*” }.join(’|’)
end
end

   UNICODE_WHITESPACE = [
     (0x0009..0x000D).to_a,  # White_Space # Cc   [5]


0x0020, # White_Space # Zs SPACE
0x0085, # White_Space # Cc
0x00A0, # White_Space # Zs NO-BREAK SPACE
0x1680, # White_Space # Zs OGHAM SPACE MARK
0x180E, # White_Space # Zs MONGOLIAN VOWEL
SEPARATOR
(0x2000…0x200A).to_a, # White_Space # Zs [11] EN
QUAD…HAIR SPACE
0x2028, # White_Space # Zl LINE SEPARATOR
0x2029, # White_Space # Zp PARAGRAPH SEPARATOR
0x202F, # White_Space # Zs NARROW NO-BREAK SPACE
0x205F, # White_Space # Zs MEDIUM
MATHEMATICAL SPACE
0x3000, # White_Space # Zs IDEOGRAPHIC SPACE
].flatten

   UNICODE_LEADERS_AND_TRAILERS = UNICODE_WHITESPACE + [65279] #

ZERO-WIDTH NO-BREAK SPACE aka BOM

   # Borrowed from the Kconv library by Shinji KONO - (also as

seen on the W3C site)
UTF8_PAT = /\A(?:
[\x00-\x7f] |
[\xc2-\xdf] [\x80-\xbf] |
\xe0 [\xa0-\xbf] [\x80-\xbf] |
[\xe1-\xef] [\x80-\xbf] [\x80-\xbf] |
\xf0 [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] |
[\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] |
\xf4 [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]
)*\z/xn

   UNICODE_TRAILERS_PAT = /(#{codepoints_to_pattern

(UNICODE_LEADERS_AND_TRAILERS)})+$/
UNICODE_LEADERS_PAT = /^(#{codepoints_to_pattern
(UNICODE_LEADERS_AND_TRAILERS)})+/

   # Performs Unicode-aware conversion to lowercase
   def u_downcase
     return downcase unless utf8_pragma?

     Unicode::downcase(Unicode::normalize_KC(self))
   end

   def u_downcase! #:nodoc:
      self.replace downcase
   end

   # Performs Unicode-aware conversion to UPPERCASE
   def u_upcase
     return upcase unless utf8_pragma?

     Unicode::upcase(Unicode::normalize_KC(self))
   end

   def u_upcase! #:nodoc:
      self.replace upcase
   end

   # Performs Unicode-aware Capitalization
   def u_capitalize
     capitalize unless utf8_pragma?

     Unicode::capitalize(Unicode::normalize_KC(self))
   end

   def u_capitalize! #:nodoc:
      self.replace capitalize
   end

   # Instead of fetching bytes will fetch the string composed of

codepoints at the specified offsets.
# The call with a single integer as argument will still return
a byte.
# If the string is not a valid UTF-8 sequence bytes will be
returned
def u_slice(*args)
return slice(*args) unless utf8_pragma?

     if (args.size == 2 && args.first.is_a?(Range))
       raise TypeError, 'cannot convert Range into Integer' # Do

as if we were native
elsif (args.first.is_a?(Range) or args.size == 2)
#normalize to KC so that all combined glyphs are spliced
together and ligatures split, and then…
Unicode::normalize_KC(self).unpack(“U*”).send(:slice,
args).pack("U")
else
slice(*args)
end
end

   def u_index(*args)
     if (args.first.is_a?(String) and !

args.first.has_utf8_semantics?) or !utf8_pragma?
return index(*args)
end

     bidx = index(*args)
     return nil unless bidx
     return self.slice(0...bidx).unpack("U*").size
   end

   # Replacement for the lstrip routine. Will first normalize the

string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_strip
return strip unless utf8_pragma?

     lstrip.rstrip
   end

   # Replacement for the lstrip routine. Will first normalize the

string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_lstrip
return lstrip unless utf8_pragma?

     gsub(UNICODE_LEADERS_PAT, '')
   end

   # Replacement for the rstrip routine. Will first normalize the

string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_rstrip
return rstrip unless utf8_pragma?

     gsub(UNICODE_TRAILERS_PAT, '')
   end

   def u_lstrip! #:nodoc:
     self.replace lstrip
   end

   def u_rstrip! #:nodoc:
     self.replace rstrip
   end

   def u_strip! #:nodoc:
     self.replace strip
   end

   # Decomposes the string and returns the decomposed string
   def decompose
     Unicode::decompose(self)
   end

   # Normalizes the string to form KC and returns the result
   def normalize_KC
     Unicode::normalize_KC(self)
   end

   # Normalizes the string to form D and returns the result
   def normalize_D
     Unicode::normalize_D(self)
   end

   # Normalizes the string to form C and returns the result
   def normalize_C
     Unicode::normalize_C(self)
   end

   # Provides replacement for the size routine. Will first

normalize to KC and then return the number
# of codepoints
def u_size
return size unless utf8_pragma?

     #normalize to KC so that all combiner letters are spliced

together, and then…
Unicode::normalize_KC(self).unpack(“U*”).size
end

   def u_length #:nodoc:
     u_size
   end


   # Provides replacement for the reverse routine. Will first

normalize to KC and then reverse the resulting
# codepoints
def u_reverse
return reverse unless utf8_pragma?

     Unicode::normalize_KC(self).unpack("U*").reverse.pack("U*")
   end

   # Inserts the string at codepoint offset specified in offset.
   def u_insert(offset, fragment)
     return insert(offset, fragment) unless utf8_pragma?

     self.replace(unpack("U*").insert(offset, fragment.unpack

(“U*”)).flatten.pack(“U*”))
end

   # Returns false or true depending on whether the string has

UTF-8 semantics (a String used for purely
# byte resources is unlikely to have them).
def has_utf8_semantics?
UTF8_PAT.match(self)
end

   private
     def utf8_pragma?
       ($KCODE == 'UTF8') and (self.has_utf8_semantics?)
     end
 end

 if defined?(RAILS_DEFAULT_LOGGER)
   RAILS_DEFAULT_LOGGER.warn "Standard string functions have been

overloaded with " +
“UTF8-aware versions”
end
end
rescue LoadError
if defined?(RAILS_DEFAULT_LOGGER)
RAILS_DEFAULT_LOGGER.error "You don’t have the Unicode library
installed, most string " +
“operations will stay single-byte”
end
end