Forum: Ruby on Rails How do I get substring of utf-8 string?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Dae San H. (Guest)
on 2006-03-21 04:51
(Received via mailing list)
I'm trying to get substring from a utf-8 encoded string.  (say, first
50 characters of the string)  String#[0..49] would give me the first
50 bytes not 50 characters..

I know there is jcode library, but it only let you count number of
characters in utf-8 string.

unicode gem doesn't seem to help much.  unicode_hacks gem seem to
solve the problem, but it also seems to change the methods of String
class directly so that it may confuse rails which expects String#[]
to give back bytes not characters.

Can somebody point out what should be the route I should take?
Should I implement substring methods myself?  Have not someone
already solved this problem?

thanks,

daesan
Alex Zhukov (Guest)
on 2006-03-22 07:22
(Received via mailing list)
If you only need a substring you can fake it with the jcode library.
Use something like this:

$KCODE='u'
require 'jcode'

class String
   def usubstr a, b
     i = 0
     buff = ''
     each_char do
       | c |
       i += 1
       if i >= a: buff += c end
       if i == b: return buff end
     end
   end
end

bla = "put here some unicode string"
puts bla.usubstr 6, 10

It works with cyrillic UTF-8 text, should work for other languages too.
I hope this helps.

--
best regards,
Alex Zhukov
removed_email_address@domain.invalid
Dae San H. (Guest)
on 2006-03-22 08:29
(Received via mailing list)
Awesome!  I must have overlooked each_char method in jcode library.

In the mean time, I modified unicode_hacks by Julik so that it would
not overload existing String methods.  I attached the modified source
code at the end of this email.  UTF-8 compatible equivalent methods
are prefixed by 'u_'.  For example, length of UTF-8 string is
returned by 'u_length' method, substring of UTF-8 string is returned
by 'u_slice', etc.  This hack requires unicode gem.

Thank you Alex for the tip.  I would use your tip for simpler needs. :-)

daesan

ps: original unicode_hacks is available at http://
julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/


On Mar 22, 2006, at 2:22 PM, Alex Zhukov wrote:

>     each_char do
>
>
>> solve the problem, but it also seems to change the methods of
>> _______________________________________________
>> Rails mailing list
>> removed_email_address@domain.invalid
>> http://lists.rubyonrails.org/mailman/listinfo/rails
>
> _______________________________________________
> Rails mailing list
> removed_email_address@domain.invalid
> http://lists.rubyonrails.org/mailman/listinfo/rails


# This is a modified version of unicode_hacks so that regular String
methods are not overloaded.  Instead, UTF-8 compatible equivalent
methods are prefixed with "u_".

begin
   require 'unicode'
   # Do some SUBSTANTIAL rewiring of the String class. This doesn't
solve all of the problems
   # but it does solve some. And it will work in UTF-8 context only,
so we step aside
   # if $KCODE is not UTF-8 (Japanese people prefr JIS, right?)
   #
   # Following the tradition - I am grateful to Yoshida MASATO for
the Unicode gem.
   #
   # The core capabilities of String are changed by this module only
when $KCODE is set to 'UTF8'.
   # Strings start to properly trim, properly strip and size, and do
many other nice things they have
   # been supposed to do for ages.
   # All "old" byte-oriented methods of Strings are still available
with "byte_" prefix (i.e. "byte_reverse", "byte_slice")
   class String
   end

   unless defined?(String::UNICODE_REWIRED) # rewire only once even
if it's reloaded
     String.class_eval do

       UNICODE_REWIRED = true

       class <<self
         # Returns a regular expression pattern that matches the
passed Unicode codepoints
         def codepoints_to_pattern(array_of_codepoints)
           array_of_codepoints.collect{ |e| [e].pack "U*" }.join('|')
         end
       end

       UNICODE_WHITESPACE = [
         (0x0009..0x000D).to_a,  # White_Space # Cc   [5]
<control-0009>..<control-000D>
         0x0020,          # White_Space # Zs       SPACE
         0x0085,          # White_Space # Cc       <control-0085>
         0x00A0,          # White_Space # Zs       NO-BREAK SPACE
         0x1680,          # White_Space # Zs       OGHAM SPACE MARK
         0x180E,          # White_Space # Zs       MONGOLIAN VOWEL
SEPARATOR
         (0x2000..0x200A).to_a, # White_Space # Zs  [11] EN
QUAD..HAIR SPACE
         0x2028,          # White_Space # Zl       LINE SEPARATOR
         0x2029,          # White_Space # Zp       PARAGRAPH SEPARATOR
         0x202F,          # White_Space # Zs       NARROW NO-BREAK SPACE
         0x205F,          # White_Space # Zs       MEDIUM
MATHEMATICAL SPACE
         0x3000,          # White_Space # Zs       IDEOGRAPHIC SPACE
       ].flatten

       UNICODE_LEADERS_AND_TRAILERS = UNICODE_WHITESPACE + [65279] #
ZERO-WIDTH NO-BREAK SPACE aka BOM

       # Borrowed from the Kconv library by Shinji KONO - (also as
seen on the W3C site)
       UTF8_PAT = /\A(?:
                     [\x00-\x7f]                                     |
                     [\xc2-\xdf] [\x80-\xbf]                         |
                     \xe0        [\xa0-\xbf] [\x80-\xbf]             |
                     [\xe1-\xef] [\x80-\xbf] [\x80-\xbf]             |
                     \xf0        [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] |
                     [\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] |
                     \xf4        [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]
                    )*\z/xn


       UNICODE_TRAILERS_PAT = /(#{codepoints_to_pattern
(UNICODE_LEADERS_AND_TRAILERS)})+$/
       UNICODE_LEADERS_PAT = /^(#{codepoints_to_pattern
(UNICODE_LEADERS_AND_TRAILERS)})+/

       # Performs Unicode-aware conversion to lowercase
       def u_downcase
         return downcase unless utf8_pragma?

         Unicode::downcase(Unicode::normalize_KC(self))
       end

       def u_downcase! #:nodoc:
          self.replace downcase
       end

       # Performs Unicode-aware conversion to UPPERCASE
       def u_upcase
         return upcase unless utf8_pragma?

         Unicode::upcase(Unicode::normalize_KC(self))
       end

       def u_upcase! #:nodoc:
          self.replace upcase
       end

       # Performs Unicode-aware Capitalization
       def u_capitalize
         capitalize unless utf8_pragma?

         Unicode::capitalize(Unicode::normalize_KC(self))
       end

       def u_capitalize! #:nodoc:
          self.replace capitalize
       end

       # Instead of fetching bytes will fetch the string composed of
codepoints at the specified offsets.
       # The call with a single integer as argument will still return
a byte.
       # If the string is not a valid UTF-8 sequence bytes will be
returned
       def u_slice(*args)
         return slice(*args) unless utf8_pragma?

         if (args.size == 2 && args.first.is_a?(Range))
           raise TypeError, 'cannot convert Range into Integer' # Do
as if we were native
         elsif (args.first.is_a?(Range) or args.size == 2)
           #normalize to KC so that all combined glyphs are spliced
together and ligatures split, and then....
           Unicode::normalize_KC(self).unpack("U*").send(:slice,
*args).pack("U*")
         else
           slice(*args)
         end
       end

       def u_index(*args)
         if (args.first.is_a?(String) and !
args.first.has_utf8_semantics?) or !utf8_pragma?
           return index(*args)
         end

         bidx = index(*args)
         return nil unless bidx
         return self.slice(0...bidx).unpack("U*").size
       end

       # Replacement for the lstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
       # including line breaks and nonbreaking spaces
       def u_strip
         return strip unless utf8_pragma?

         lstrip.rstrip
       end

       # Replacement for the lstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
       # including line breaks and nonbreaking spaces
       def u_lstrip
         return lstrip unless utf8_pragma?

         gsub(UNICODE_LEADERS_PAT, '')
       end

       # Replacement for the rstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
       # including line breaks and nonbreaking spaces
       def u_rstrip
         return rstrip unless utf8_pragma?

         gsub(UNICODE_TRAILERS_PAT, '')
       end

       def u_lstrip! #:nodoc:
         self.replace lstrip
       end

       def u_rstrip! #:nodoc:
         self.replace rstrip
       end

       def u_strip! #:nodoc:
         self.replace strip
       end

       # Decomposes the string and returns the decomposed string
       def decompose
         Unicode::decompose(self)
       end

       # Normalizes the string to form KC and returns the result
       def normalize_KC
         Unicode::normalize_KC(self)
       end

       # Normalizes the string to form D and returns the result
       def normalize_D
         Unicode::normalize_D(self)
       end

       # Normalizes the string to form C and returns the result
       def normalize_C
         Unicode::normalize_C(self)
       end

       # Provides replacement for the size routine. Will first
normalize to KC and then return the number
       # of codepoints
       def u_size
         return size unless utf8_pragma?

         #normalize to KC so that all combiner letters are spliced
together, and then....
         Unicode::normalize_KC(self).unpack("U*").size
       end

       def u_length #:nodoc:
         u_size
       end


       # Provides replacement for the reverse routine. Will first
normalize to KC and then reverse the resulting
       # codepoints
       def u_reverse
         return reverse unless utf8_pragma?

         Unicode::normalize_KC(self).unpack("U*").reverse.pack("U*")
       end

       # Inserts the string at codepoint offset specified in offset.
       def u_insert(offset, fragment)
         return insert(offset, fragment) unless utf8_pragma?

         self.replace(unpack("U*").insert(offset, fragment.unpack
("U*")).flatten.pack("U*"))
       end

       # Returns false or true depending on whether the string has
UTF-8 semantics (a String used for purely
       # byte resources is unlikely to have them).
       def has_utf8_semantics?
         UTF8_PAT.match(self)
       end

       private
         def utf8_pragma?
           ($KCODE == 'UTF8') and (self.has_utf8_semantics?)
         end
     end

     if defined?(RAILS_DEFAULT_LOGGER)
       RAILS_DEFAULT_LOGGER.warn "Standard string functions have been
overloaded with " +
                                 "UTF8-aware versions"
     end
   end
rescue LoadError
   if defined?(RAILS_DEFAULT_LOGGER)
     RAILS_DEFAULT_LOGGER.error "You don't have the Unicode library
installed, most string " +
                                "operations will stay single-byte"
   end
end
This topic is locked and can not be replied to.