Text Munger (#76)

bbazzarrakk · April 22, 2006, 4:33pm

On Apr 22, 2006, at 5:28 AM, Ross B. wrote:

$ ./munger.rb test.txt
Attehcaed is my résmué.
$ ./munger.rb test.txt
Atthaceed is my réumsé.
$ ./munger.rb test.txt
Attacheed is my rémsué.
$ ./munger.rb test.txt
Attcaehed is my rémusé.
$ ./munger.rb test.txt
Attecahed is my rémsué.

Why are the e’s not moving?

James Edward G. II

bbazzarrakk · April 22, 2006, 1:09pm

On Sat, 2006-04-22 at 19:28 +0900, Ross B. wrote:

numbers have to be left alone? Or does your solution rearrange
both numbers and letters?

I’m still waiting for someone to show off their solution properly
handling the trivial (multi-byte) example I showed earlier…

$ ./munger.rb test.txt
Attehcaed is my rÃ©smuÃ©.

(Sorry for the noise) - the test text used there doesn’t go too well
with my solution, which limits how much of a word is rearranged. This is
a better example:

[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La viiosn euroÃ©nepne strÃ©giatque
[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La vioisn eurpeÃ©none strgaÃ©itque
[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La vioisn eurenopÃ©ne strtagiÃ©que
[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La vision eurpeÃ©onne strÃ©igatque
[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La vision eurÃ©onpene strgtÃ©iaque
[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La visoin eureÃ©onpne striagtÃ©que

(from La vision europÃ©enne stratÃ©gique)

bbazzarrakk · April 22, 2006, 5:53pm

On Sat, 2006-04-22 at 23:31 +0900, James Edward G. II wrote:

$ ./munger.rb test.txt
Attecahed is my rÃ©msuÃ©.

Why are the e’s not moving?

My solution scrambles only part of the inside of the word, depending on
the word length, and favours keeping more from the start of the word. So
with this example it’s taking the six letter ‘rÃ©msuÃ©’ and deciding to
scramble 3 letters, ‘msu’ (remains after we take two from the start, one
from the end). So with that input, the e’s wouldn’t be touched.

I just didn’t think about that before I posted - the second output I
posted showed some longer accented words with the e’s moving around
properly :).

bbazzarrakk · April 22, 2006, 5:59pm

Ross B. wrote:

[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La vision eurÃ©onpene strgtÃ©iaque
[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt
La visoin eureÃ©onpne striagtÃ©que

(from La vision europÃ©enne stratÃ©gique)

The pattern, “eu???ne str???que” is constant in your results.

–

Ray

bbazzarrakk · April 22, 2006, 6:08pm

On Sun, 2006-04-23 at 00:56 +0900, Ray B. wrote:

[rosco@jukebox text-munger-76]$ ./munger.rb test2.txt

Well, that’s a question of more random, or more readable. A good point
was raised about longer words becoming unrecognisable when just randomly
scrambled…

OTOH If we’re doing random scrambling, leaving only first and last
letter I think I can get back down to two lines…

What’s everyone else doing?

bbazzarrakk · April 22, 2006, 6:25pm

unknown wrote:

But \w includes underscore. I think punctuation is supposed to remain
unscrambled, isn’t it? And numbers likewise.

Point taken.

perl -pe
‘s/(?<=[a-z])[a-z]+(?=[a-z])/join"",sort{rand>0.5}split"",$&/egi’

(69 chars)

bbazzarrakk · April 22, 2006, 6:40pm

Use negated word boundaries (\B) instead of the lookarounds to lose a
few
characters.

bbazzarrakk · April 22, 2006, 6:24pm

“text”.gsub(/\B(\w{2,})\B/) { |s| s.length.times { |i| r =
rand(s.length);
s[i], s[r] = s[r], s[i] }; s }

It is one line, but it does have a couple of semi-colons.

bbazzarrakk · April 22, 2006, 7:35pm

Alex Barrett wrote:

Use negated word boundaries (\B) instead of the lookarounds to lose a
few
characters.

Thanks. And character twiddling rather than sort:

perl -pe 's/\B([a-z])([a-z])\B/rand>.5?$1.$2:$2.$1/egi'

bbazzarrakk · April 22, 2006, 7:42pm

On 22-Apr-06, at 1:35 PM, PerlyGates wrote:

Alex Barrett wrote:

Use negated word boundaries (\B) instead of the lookarounds to lose a
few
characters.

Doesn’t \B bring back the problems with _ ?

Mike

Thanks. And character twiddling rather than sort:
perl -pe 's/\B([a-z])([a-z])\B/rand>.5?$1.$2:$2.$1/egi'
–
Posted via http://www.ruby-forum.com/.

–

Mike S. [email protected]
http://www.stok.ca/~mike/

The “`Stok’ disclaimers” apply.

bbazzarrakk · April 23, 2006, 12:44pm

Himadri C. wrote:

In order to see any performance benefit from the 3rd method I had to make up
some horrifically long words which aren’t terribly likely in the English
language (maybe I should have tried German :)).

Try Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

bbazzarrakk · April 23, 2006, 11:48am

Here’s my solution.

Usage : scramble.rb <text_file>

I made 3 attempts.

print ARGF.read.gsub!(/\B[a-z]+\B/) {|x| x.split(’’).sort_by{rand}.join}

Here I use gsub to find all the words. Use split to convert strings into
arrays. And then use the sort_by{rand} to scramble the arrays. And
finally
use join to convert the array back to a string.
I’m assuming that words don’t have upper case letters in the middle, so
that
I can get away with [a-z].

print ARGF.read.gsub!(/\B[a-z]+\B/) {|x| x.unpack
(‘c*’).sort_by{rand}.pack(‘c*’)}

I found this method of converting strings to and from arrays to be
faster.
I’m not sure what the standard idiom for doing this is. But, I’m sure
I’ll
learn after seeing other people’s solutions

3 If sort_by{rand} does what I think it does, it probably has a bias
when
the rand function returns the same value. So, this is my third
implementation:

print ARGF.read.gsub!(/\B[a-z]+\B/) {|x|
x.length.times {|i|
j = rand(i+1)
x[j], x[i] = x[i] , x[j]
}
x
}

Basically, this is an implementation of scrambling that uses swaps. I
remember this method for scrambling from way back, but I can’t seem to
find
a good reference for it at the moment.
I also figured that this method would be faster since it is linear,
while
the sorts are n log(n) (n = length of the word)

To by surprise, I found this method to actually be slower for any normal
text. One possible explanation is that when words are relatively short
you
don’t gain much from the n vs. nlogn difference, and you lose because
while
this method always has n swaps, sorting may have less.

In order to see any performance benefit from the 3rd method I had to
make up
some horrifically long words which aren’t terribly likely in the
English
language (maybe I should have tried German :)).

Himadri

bbazzarrakk · April 23, 2006, 2:53pm

Seems like forty-eight hours are up now, so here are my solutions for
this quiz, it was good to get a quick one I wrote a simple random
munging solution, and a slightly longer one that munges only part of the
words. I went for a different way on the latter one, just to play with
regexps a bit, but I expect its performance isn’t great…

Both support unicode properly, as long as the -Ku stays on the ruby
command line I could have used the u modifier instead but wanted to
save on the repetition.

========= random munging

#!/usr/local/bin/ruby -Ku
$stdout << ARGF.read.gsub(/\B((?![\d_])\w{2,})\B/) do |w|
$&.split(//).sort_by { rand }
end

(easily compresses to:)

#!/usr/local/bin/ruby -npKu
gsub(/\B((?![\d_])\w){2,}\B/){$&.split(//).sort_by{rand}}

========= slightly-less-random munging

#!/usr/local/bin/ruby -Ku
RX =
Hash.new{|h,k|h[k]=/(.{#{(k/4.0).round}})#{’(.)’(k/2.0).round}(.)/}
$stdout << ARGF.read.gsub(/((?![\d_])\w){4,}/) do |w|
(caps = RX[w.split(//u).length].match(w).captures).first +
caps[1…-2].sort_by { rand }.to_s + caps.last
end

bbazzarrakk · April 23, 2006, 4:53pm

On Apr 23, 2006, at 4:45 AM, Himadri C. wrote:

Here’s my solution.

Just a gentle reminder here folks, please remember that Ruby Q. has
a 48 hour no-spoiler period before solutions should be posted. I’m
not a big stickler on this, but I know some people do like the time.
It’s super easy to figure in your head, just look at the quiz date
and time and bump it forward two days. That’s when it’s OK to submit.

For the record, I do consider posting solutions in other languages
(like Perl) a spoiler.

Thank you.

James Edward G. II

bbazzarrakk · April 23, 2006, 2:59pm

Is the performance better if you skip swaps when i == j ?

Also, for a swap method to give random results doesn’t one need to
swap from a random position in the array which has not been passed
through yet? (see Shuffling - Wikipedia noting Fisher-
Yates shuffling.)

-a

bbazzarrakk · April 23, 2006, 5:34pm

Your task for this quiz, then, is to take a text as input and output the
text in this fashion. Scramble each word’s center (leaving the first and
last letters of each word intact). Whitespace, punctuation, numbers –
anything that isn’t a word – should also remain unchanged.

solution one

 harp:~ > cat a.rb
 class String
   def scramble on = ''
     re = %r/( (?:\b \w \w{2,} \w \b) | \s+ | . )/iox
     scan(re){|words| on << words.first.scrambled}
     on
   end
   def scrambled
     self[1..-2] = self[1..-2].split(%r//).sort_by{rand}.to_s if

size >= 4
self
end
end
ARGF.read.scramble STDOUT

 harp:~ > ruby a.rb < a.rb
 cslas Srntig
   def srbcamle on = ''
     re = %r/( (?:\b \w \w{2,} \w \b) | \s+ | . )/iox
     sacn(re){|wrods| on << wodrs.fisrt.salbercmd}
     on
   end
   def sclmaebrd
     slef[1..-2] = slef[1..-2].split(%r//).s_botry{rnad}.t_os if

size >= 4
slef
end
end
ARGF.read.srcalbme SUDOTT

solution two (golfing)

 harp:~ > ruby -npae

‘gsub!(/\b(\w)(\w{2,})(\w)\b/){=$3;[$1,$2.split(//).sort_by{rand},]}’
a.rb
calss Snrtig
def sbcarlme on = ‘’
re = %r/( (?:\b \w \w{2,} \w \b) | \s+ | . )/iox
sacn(re){|wdros| on << wdros.first.slramcebd}
on
end
def smlcbaerd
self[1…-2] = slef[1…-2].siplt(%r//).srbt_oy{rand}.t_os if
size >= 4
self
end
end
ARGF.read.smclarbe SUTODT

 harp:~ > wc -c
 gsub!(/\b(\w)(\w{2,})(\w)\b/){_=$3;[$1,$2.split(//).sort_by{rand},_]}
      70

thanks for the fun quiz!

-a

bbazzarrakk · April 23, 2006, 6:45pm

On Apr 23, 2006, at 4:45 AM, Himadri C. wrote:

to find
a good reference for it at the moment.

James Edward G. II

bbazzarrakk · April 23, 2006, 6:51pm

On 4/23/06, James Edward G. II [email protected] wrote:

For the record, I do consider posting solutions in other languages
(like Perl) a spoiler.

How about COBOL or FORTRAN? Or is that a spoiler for a different
reason?

bbazzarrakk · April 23, 2006, 7:11pm

Well, the simple regex based one-liner seems to have gotten
plenty of airplay, so I decided to expand mine in an attempt
to improve the readability of the munged text. For example:

A naive munging:

Noumeurs idavilundis have dneoatrstmed the ieneascrd
dfifclutiy oinrcrucg wehn leihngter wdors are slipmy
reiondmazd. Raionizdnmg wiihtn hntyahoeipn buadoreins
offers smoe irnmoeemvpt.

A slightly more readable munging:

Nuemruos inididvuals hvae dnometrtsaed the insecraed
dfiifulcty ocucrinrg when lghetnier wrods are smiply
rnamodized. Randomzinig wihtin hyphenatoin bonduiares
offres some imvoepremnt.

Original text:

Numerous individuals have demonstrated the increased
difficulty occurring when lengthier words are simply
randomized. Randomizing within hyphenation boundaries
offers some improvement.

The hyphen-boundary randomizer:

require ‘text/hyphen’
hyp = Text::Hyphen.new :left => 1, :right => 1
text = ARGF.read
text.gsub!(/[^\W\d_]+/) do |m|
hyp.visualize(m).split(/(^\w|\w$)|-/).map{|t|
t.split(//).sort_by{rand}.join
}.join
end
puts text
END

cheers,
andrew

bbazzarrakk · April 23, 2006, 7:53pm

On Apr 21, 2006, at 2:34 PM, Ruby Q. wrote:

The three rules of Ruby Q.:
[…]
Suggestion: A [QUIZ] in the subject of emails about the problem
helps everyone
on Ruby T. follow the discussion.

Can you also suggest that people reply to the original thread instead
of making new ones when they send their solutions? Right now there is:

Original ruby quiz thread
[QUIZ][SOLUTION] …
[QUIZ] … A solution
[QUIZ] … A simplistic solution
[SOLUTION] …

– Daniel