Forum: Ruby Text Munger (#76): A solution

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
893c38bd5f182afc751540657d8aacf7?d=identicon&s=25 Stefano Taschini (Guest)
on 2006-04-23 15:03
(Received via mailing list)
First the solution and then the comments:


     1  #!/usr/bin/env ruby

     2  require 'unicode'

     3  class String

     4    Diacritic = Regexp.new("[\xcc\x80-\xcd\xaf]",nil,'u')
     5    Specials =
"\xc3\x86\xc3\x90\xc3\x98\xc3\x9e\xc3\x9f\xc3\xa6\xc3\xb0\xc3\xb8\xc3\xbe"
     6    Letter =
Regexp.new("[A-Za-z#{Specials}](?:#{Diacritic}*)",nil,'u')
     7    Word =
Regexp.new("(#{Letter})(#{Letter}+)(?=#{Letter})",nil,'u')

     8    def scramble
     9      Unicode.compose(Unicode.decompose(self).gsub(Word) {
    10        m = $~
    11        m[1] + m[2].scan(Letter).sort_by{rand}.join})
    12    end

    13  end

    14  if __FILE__ == $0
    15    while gets
    16      puts $_.chomp.scramble
    17    end
    18  end

First of all, we want the scramble to be able to handle accented
characters. For this, we require the unicode package (available as a
gem) in line 2, for its normalization functions that decompose an
accented character into a standard latin letter and a diacritic.

The letters in iso-latin1 that cannot be decomposed in a plain latin
letter + diacritic are: Thorn, Eth, AE, stroked O, sharp S. The
corresponding 9 forms (excepted the last one, the others can be small
or capital) must be treated as a "special case" in line 5.

The regular expression in line 6 identifies a possibly accented letter.

If ruby had positive zero-width positive look-behind assertions, i.e.,
Perl's /(?<=pattern)/, a word could be decomposed into letters as

Word = Regexp.new("(?<=#{Letter})(#{Letter}+)(?=#{Letter})",nil,'u')

Unfortunately, Ruby doesn't have a $<=, so we are forced to capture
the first character with the regular expression in line 7, and we have
to remember to put it back unchanged (m[1] in line 11).

I might have written lines 10 and 11 together as

$1 + $2.scan(Letter).sort_by{rand}.join})

but you don't have to be a C programmer to understand that using
global variables together with functions that alter them within two
sequence points is a bad idea (See
http://www.parashift.com/c++-faq-lite/misc-technic...
).
956f185be9eac1760a2a54e287c4c844?d=identicon&s=25 ts (Guest)
on 2006-04-23 15:24
(Received via mailing list)
>>>>> "S" == Stefano Taschini <taschini.mlist@gmail.com> writes:

S> I might have written lines 10 and 11 together as

S> $1 + $2.scan(Letter).sort_by{rand}.join})

S> but you don't have to be a C programmer to understand that using
S> global variables together with functions that alter them within two
S> sequence points is a bad idea

 $~ is not a global variable : it's a local and thread-local variable
(like
 $_)

 $1 ($2, ...) make reference to the first (second, ...) substring
matched.


Guy Decoux
893c38bd5f182afc751540657d8aacf7?d=identicon&s=25 Stefano Taschini (Guest)
on 2006-04-23 17:43
(Received via mailing list)
You are absolutely right, of course.

I hope that my mistake did not distract anybody from the point I was
trying to make.

  Stefano
This topic is locked and can not be replied to.