Forum: Ruby Text Munger (#76)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
James G. (Guest)
on 2006-04-27 16:00
(Received via mailing list)
Obviously, this is not an overly difficult problem.  Here's a small, but
pretty
easy to follow solution by Gordon T.:

	class String

	  def munge
	    split(/\b/).munge_each.join
	  end

	end

	class Array

	  def munge_each
	    map { |word| word.split(//).munge_word }
	  end

	  def munge_word
	    first,last,middle = shift, pop,scramble
	    "#{first}#{middle}#{last}"
	  end

	  def scramble
	    sort_by{rand}
	  end

	end

	if __FILE__ == $PROGRAM_NAME

	  begin
	    puts File.open(ARGV[0], 'r').read.munge
	  rescue
	    puts "Usage:  text_munge.rb file"
	  end

	end

The flow here is simple:  bust up the document into words, munge all
words, and
stitch it back together.  Munging a word is just separating it into
characters
and rearranging everything but the first and last character.

Probably the trickiest line in the whole deal is the first and only line
in
munge().  It breaks the passed document on word boundaries, which will
be every
place a word begins and ends.  Thus, given the sentence:

	Here is a simple sentence, for testin' scripts.

Gordon's code will break the document into this Array:

	[ "Here", " ", "is", " ", "a", " ", "simple", " ", "sentence", ", ",
	  "for", " ", "testin", "' ", "scripts", ".\n" ]

It's important to remember that this is the Regular Expression
definition of
"words", including digit characters and the underscore.  That's not a
perfect
match for the quiz task, but was a popular choice nonetheless.

Now, I did say *all* words are scrambled and that is what I meant.  A
run of
four or more punctuation characters is a word, and the middle
punctuation would
be scrambled.  In practice, this is rare enough to be a minor issue.

I made a bit of a fuss about multi-byte characters during the
discussion, which
some people did try to satisfy.  It's only fair I add detail here.

There are many multi-byte character encodings, but I will focus on just
the UTF8
encoding, because I am way out of my league with anything else.  If you
are
unfamiliar with Unicode encodings, this article is a pretty good general
introduction:

	http://www.joelonsoftware.com/printerFriendly/arti...

The Ruby specifics are harder to come by, sadly.

Basically, Ruby's Unicode support (UTF8 encoding only) is through
regular
expressions (using matches or methods like split()).  They can be made
character
aware (instead of bytes) by properly setting $KCODE.  Here's an example:

	$ cat byte_string.rb
	#!/usr/local/bin/ruby -w

	"résumé".split("").each { |chr| p chr }
	$ ruby byte_string.rb
	"r"
	"\303"
	"\251"
	"s"
	"u"
	"m"
	"\303"
	"\251"
	$ cat utf8_string.rb
	#!/usr/local/bin/ruby -w

	$KCODE = "UTF8"

	"résumé".split("").each { |chr| p chr }
	$ ruby utf8_string.rb
	"r"
	"é"
	"s"
	"u"
	"m"
	"é"

Notice that when I didn't set $KCODE, the two-byte letter is split.
However,
when I tell Ruby to be Unicode aware, they stay together.

That should tell you enough background to spot the solutions that can
handle it
from the ones that can't, giving you more examples to look at.  Here's a
multi-byte aware solution from Ross B. (-Ku is a shortcut for
$KCODE =
"UTF8"):

	#!/usr/local/bin/ruby -Ku
	$stdout << ARGF.read.gsub(/\B((?![\d_])\w{2,})\B/) do |w|
	  $&.split(//).sort_by { rand }
	end

That's mainly just a more compact version of Gordon's script.  This time
though,
we are interested in the results of running it.  Watch the é hop around
as I
run it a few times:

	$ ruby Ross\ Bamford/scramble.rb test_document.txt
	Actheatd is my rsuémé.
	$ ruby Ross\ Bamford/scramble.rb test_document.txt
	Aaectthd is my rmséué.
	$ ruby Ross\ Bamford/scramble.rb test_document.txt
	Aatcethd is my rémsué.

Gordon's solution is non multi-byte aware out of the box.  Watch how
things
change with that:

	$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
	Achttead is my résumé.
	$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
	Aehttacd is my résumé.
	$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
	Aheatctd is my résum?.?

In order to make sense of that, you need to see how the code found the
words in
that line:

	["Attached", " ", "is", " ", "my", " ", "r", "\303\251", "sum",
"\303\251.\n"]

See how the last é is lumped in with the end punctuation?  That makes
the group
of characters long enough to scramble.  Then they are junk characters my
terminal doesn't know how to display.

The good news is, we can magically fix Gordon's script:

	$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
	Aatcehtd is my réumsé.
	$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
	Athetcad is my rmuésé.
	$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
	Atcthead is my rmséué.

We probably can't fix all the solutions like this though.  It depends on
how
they separated the word into letters.

The downside of this is that it makes it harder to recognize word
characters,
without the digits and underscores.  Filtering out punctuation is a lot
harder
when we expand to such a vast definition of characters.   I'm not aware
of a
good Ruby solution for that issue yet.  (Please enlighten me if you
are!)

My thanks to Matthew for another great quiz and to all who gave it a
shot.

Tomorrow we will build a simple tool for those of you showing off your
code in
an IRC channel...
unknown (Guest)
on 2006-04-27 16:42
(Received via mailing list)
Hi --

On Thu, 27 Apr 2006, Ruby Q. wrote:

> It's important to remember that this is the Regular Expression definition of
> "words", including digit characters and the underscore.  That's not a perfect
> match for the quiz task, but was a popular choice nonetheless.
>
> Now, I did say *all* words are scrambled and that is what I meant.  A run of
> four or more punctuation characters is a word, and the middle punctuation would
> be scrambled.  In practice, this is rare enough to be a minor issue.

"Are you kiddin'?!" he exclaimed :-)

I thought some of the solutions addressed these problems, didn't they,
with [^\W\d_] and such?


David

--
David A. Black (removed_email_address@domain.invalid)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

"Ruby for Rails" PDF now on sale!  http://www.manning.com/black
Paper version coming in early May!
James G. (Guest)
on 2006-04-27 16:53
(Received via mailing list)
On Apr 27, 2006, at 7:41 AM, removed_email_address@domain.invalid wrote:

>> Now, I did say *all* words are scrambled and that is what I
>> meant.  A run of
>> four or more punctuation characters is a word, and the middle
>> punctuation would
>> be scrambled.  In practice, this is rare enough to be a minor issue.
>
> "Are you kiddin'?!" he exclaimed :-)
>
> I thought some of the solutions addressed these problems, didn't they,
> with [^\W\d_] and such?

I meant that scrambling long runs of punctuation didn't really seem
to be a problem in actual usage.

Yes, some did correctly find the right characters to scramble.
Worse, I seem to have completely overlooked this gem I was just
informed about off-list:

On Apr 27, 2006, at 7:36 AM, Stefano T. wrote:
>> when we expand to such a vast definition of characters.   I'm not
> Stefano
>
>
> [1] http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby...

My apologies to those who didn't receive the proper credit.  :(

James Edward G. II
unknown (Guest)
on 2006-04-27 17:02
(Received via mailing list)
Hi --

On Thu, 27 Apr 2006, James Edward G. II wrote:

>>> match for the quiz task, but was a popular choice nonetheless.
>> with [^\W\d_] and such?
>
> I meant that scrambling long runs of punctuation didn't really seem to be a
> problem in actual usage.

I gues "Are you kiddin'?!" is a bit of a stretch -- though possible --
but consider:

   "I'm just not sure...." he said.


David

--
David A. Black (removed_email_address@domain.invalid)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

"Ruby for Rails" PDF now on sale!  http://www.manning.com/black
Paper version coming in early May!
James G. (Guest)
on 2006-04-27 17:32
(Received via mailing list)
On Apr 27, 2006, at 8:01 AM, removed_email_address@domain.invalid wrote:

> I gues "Are you kiddin'?!" is a bit of a stretch -- though possible --
> but consider:
>
>   "I'm just not sure...." he said.

You bring up some good points.

James Edward G. II
Robert D. (Guest)
on 2006-04-27 18:16
(Received via mailing list)
On 4/27/06, James Edward G. II <removed_email_address@domain.invalid> wrote:
> James Edward G. II
This is jewel James, but just tell me what will they become than?

Robert




--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
James G. (Guest)
on 2006-04-27 18:43
(Received via mailing list)
On Apr 27, 2006, at 9:16 AM, Robert D. wrote:

>> You bring up some good points.
>>
>> James Edward G. II
>
>
> This is jewel James, but just tell me what will they become than?

If I understood the question correctly, David is trying to show that
it is not too rare to have lengthy punctuation, which will be seen as
words to scramble:

   "Are you kiddin<<<'?!" >>>he said.
   "I'm just not sure<<<..." >>>he said.

The first and last characters of those would be anchored, but the
middle punctuation might move:

   "Are you kiddin<<<'!" ?>>>he said.
   "I'm just not sure<<< ."..>>>he said.

Hope that make sense.

James Edward G. II
Robert D. (Guest)
on 2006-04-28 11:47
(Received via mailing list)
On 4/27/06, James Edward G. II <removed_email_address@domain.invalid> wrote:
> >>>  "I'm just not sure...." he said..
>    "Are you kiddin<<<'?!" >>>he said.
> James Edward G. II
>
>
>
My appologies to James and the list,

I was just referring to what seemed  a funny pun to me:

because there was
>>  "I'm just not sure...." he said.
>   you brought up some good points
I was concerned about the  "...."  p o i n t s
and bringing them up would give us, well I donno maybe "!!!!!" ?
Oh boy  I thaught that was funny, but seems I was the only 1 :(

Sorry for the noise
Robert

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
This topic is locked and can not be replied to.