Text Munger (#76): A solution


#1

First the solution and then the comments:

 1  #!/usr/bin/env ruby

 2  require 'unicode'

 3  class String

 4    Diacritic = Regexp.new("[\xcc\x80-\xcd\xaf]",nil,'u')
 5    Specials =

“\xc3\x86\xc3\x90\xc3\x98\xc3\x9e\xc3\x9f\xc3\xa6\xc3\xb0\xc3\xb8\xc3\xbe”
6 Letter =
Regexp.new(“A-Za-z#{Specials}”,nil,‘u’)
7 Word =
Regexp.new("(#{Letter})(#{Letter}+)(?=#{Letter})",nil,‘u’)

 8    def scramble
 9      Unicode.compose(Unicode.decompose(self).gsub(Word) {
10        m = $~
11        m[1] + m[2].scan(Letter).sort_by{rand}.join})
12    end

13  end

14  if __FILE__ == $0
15    while gets
16      puts $_.chomp.scramble
17    end
18  end

First of all, we want the scramble to be able to handle accented
characters. For this, we require the unicode package (available as a
gem) in line 2, for its normalization functions that decompose an
accented character into a standard latin letter and a diacritic.

The letters in iso-latin1 that cannot be decomposed in a plain latin
letter + diacritic are: Thorn, Eth, AE, stroked O, sharp S. The
corresponding 9 forms (excepted the last one, the others can be small
or capital) must be treated as a “special case” in line 5.

The regular expression in line 6 identifies a possibly accented letter.

If ruby had positive zero-width positive look-behind assertions, i.e.,
Perl’s /(?<=pattern)/, a word could be decomposed into letters as

Word = Regexp.new("(?<=#{Letter})(#{Letter}+)(?=#{Letter})",nil,‘u’)

Unfortunately, Ruby doesn’t have a $<=, so we are forced to capture
the first character with the regular expression in line 7, and we have
to remember to put it back unchanged (m[1] in line 11).

I might have written lines 10 and 11 together as

$1 + $2.scan(Letter).sort_by{rand}.join})

but you don’t have to be a C programmer to understand that using
global variables together with functions that alter them within two
sequence points is a bad idea (See
http://www.parashift.com/c++-faq-lite/misc-technical-issues.html#faq-39.16
).


#2

“S” == Stefano T. removed_email_address@domain.invalid writes:

S> I might have written lines 10 and 11 together as

S> $1 + $2.scan(Letter).sort_by{rand}.join})

S> but you don’t have to be a C programmer to understand that using
S> global variables together with functions that alter them within two
S> sequence points is a bad idea

$~ is not a global variable : it’s a local and thread-local variable
(like
$_)

$1 ($2, …) make reference to the first (second, …) substring
matched.

Guy Decoux


#3

You are absolutely right, of course.

I hope that my mistake did not distract anybody from the point I was
trying to make.

Stefano