DRYing a Regex

I’ve got a routine that works fine at building an array of upper-case
strings extracted from a string:

aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
not_upper= Regexp.new( upper.source.sub( /[/, ‘[^’ ) )
while not s.eos?
case
when s.skip(upper); aNewList << s.matched
else s.skip(not_upper)
end
end

But the not_upper Regexp definition is really a kludge. It somewhat
camouflages what is really /[^A-Z]+/

I’d like to DRY it by expressing it as something like !upper. I need
something like !~ we use normally with string searches.

Any ideas?

On 11/12/09, RichardOnRails
[email protected] wrote:

upper = /[A-Z]+/
not_upper= Regexp.new( upper.source.sub( /[/, ‘[^’ ) )
[snip]
But the not_upper Regexp definition is really a kludge. It somewhat
camouflages what is really /[^A-Z]+/

I’d like to DRY it by expressing it as something like !upper. I need
something like !~ we use normally with string searches.

This is somewhat better, but still not real obvious:
not_upper=/(?:.(?!#{upper}))+/ #untested, tho

Myself, I’d just write not_upper=/[^A-Z]/… for something this
short, is it really worth trying all that hard to be DRY?

On Nov 12, 2009, at 5:00 PM, RichardOnRails wrote:

else s.skip(not_upper)

end
end

But the not_upper Regexp definition is really a kludge. It somewhat
camouflages what is really /[^A-Z]+/

I’d like to DRY it by expressing it as something like !upper. I need
something like !~ we use normally with string searches.

Any ideas?

Well, you don’t really need a StringScanner for this simple task. Your
code really just rebuilds String#scan():

a_new_list = s_new_list.scan(/[A-Z]+/)

Note that I’ve also switched your variable naming style to the
snake_case that we Rubyists prefer.

Hope that helps.

James Edward G. II

On 11/12/09, RichardOnRails
[email protected] wrote:

 else s.skip(not_upper)

end
end

OTOH, you can rewrite it like this, and not have to even mention the
complement of the match you’re interested in:

aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
aNewList<< s.matched while s.skip_until(upper)

(Not tested real thoroughly, corner cases may break.)

On 11/12/09, James Edward G. II [email protected] wrote:

Well, you don’t really need a StringScanner for this simple task. Your code
really just rebuilds String#scan():

a_new_list = s_new_list.scan(/[A-Z]+/)

ooh! that’s even better.

On Nov 12, 7:37 pm, Caleb C. [email protected] wrote:

On 11/12/09, James Edward G. II [email protected] wrote:

Well, you don’t really need a StringScanner for this simple task. Your code
really just rebuilds String#scan():

a_new_list = s_new_list.scan(/[A-Z]+/)

ooh! that’s even better.

You’re right. I didn’t NEED to DRY that simple thing. I’m just
trying to improve my coding generally, especially to write things that
don’t break easily when the inevitable changes are made.

But cutting out 90% of the code, wow! That’s DRY!!

Thank you very much for your ideas. I haven’t tested it yet, but it
looks right to me.

Best wishes,
Richard

On Nov 12, 6:50 pm, James Edward G. II [email protected]
wrote:

not_upper= Regexp.new( upper.source.sub( /[/, ‘[^’ ) )
I’d like to DRY it by expressing it as something like !upper. I need
Hope that helps.

James Edward G. II

Hi James,

As I said to Caleb, cutting my 10-liner down to 1 is extreme DRYing!!
Thanks for that.

As far as underscoring vs. Camel-case goes, I know Rubyists’
preference, but I bow to Shakespeare’s notion that “a rose by any
other name is just as sweet.” I spent a couple decades writing/
maintaining Window’s application for clients using C and C++, so I’ve
a fondness for Polish notation (at least that’s what I think it was
called.) Typing extra hyphens vs pressing the shift key lets me write
code faster, and the a/s/h prefix for arrays/strings/hashes helps me
avoid a lot of interpreter complaints. And fellow programmers of
almost any stripe knows what I mean. Finally, I retired curmudgeon,
and you know how we old folks are :slight_smile:

Seriously, your insight was very helpful and will help me avoid a
bunch of wasteful code.

Best wishes,
Richard

Hey Caleb & James,

With your insights, I was able to cut down 18 lines of somewhat
obscure code to 6 lines that I find very readable. That’s such and
improvement on the quality of the code.

Though I expect you guys are tired ot this thread, I included the new
and old code below, along with results that both of them produce.

Again, thank you very much for your insights.

Best wishes,
Richard

Accept a new list as a string; extract an array of contiguous upper-

case letters as stock symbols, ignoring any duplicates (Test data)

Delete any symbol in the current list that occurs here

sNewList = %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT’,
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}

#===============

New technique

#===============
aRawNewList = sNewList.scan(/[A-Z]+/)
aNewList = Set.new(aRawNewList ).to_a.sort
nDeleted = 0
aNewList.each { |sym| hCurrentList.delete sym and nDeleted += 1 if
hCurrentList[sym] }
show_array( aNewList, 10, “New List (unique:%d, dups:%d, deleted:%s)”
%
[aNewList.size, aRawNewList.size - aNewList.size, nDeleted] , true)

#============================

Old technique; No longer used

#============================
aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
non_upper= Regexp.new( upper.source.sub( /[/, ‘[^’ ) )
nNewSyms = nCurrSymsDeleted = 0
while not s.eos?
case
when s.skip(upper)
nNewSyms+=1
aNewList << s.matched unless aNewList.include? s.matched
( hCurrentList.delete s.matched and nCurrSymsDeleted += 1) if
hCurrentList[s.matched]
else
s.skip(non_upper)
end
end
show_array( aNewList.sort, 10, “New List (%d unique; %d dups; %d curr.
deleted)” %
[aNewList.size, nNewSyms - aNewList.size, nCurrSymsDeleted] )

#=======

Output

#=======
===== New List (unique:19, dups:4, deleted:3) =====
AA ABX AMAT BRCM BUR CAT COL CSCO FCX FDX
FSLR HPQ INTC MSFT ORCL PNC PVTB TM XHB
===== =====

On Nov 13, 2:34 am, RichardOnRails
[email protected] wrote:

Hey Caleb & James,

With your insights, I was able to cut down 18 lines of somewhat
obscure code to 6 lines that I find very readable. That’s such and
improvement on the quality of the code.

I believe you can go further. For example, these three lines:

sNewList = %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT’,
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}
aRawNewList = sNewList.scan(/[A-Z]+/)
aNewList = Set.new(aRawNewList ).to_a.sort

can be replaced by one:

aNewList = %W{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT’,
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}

(you can add .sort to the end but I don’t think you need it)

also, consider something like this:

hCurrentList.delete_if { |key,v| aNewList.include?key }

– Mark.

2009/11/13 RichardOnRails
[email protected]:

As far as underscoring vs. Camel-case goes, I know Rubyists’
preference, but I bow to Shakespeare’s notion that “a rose by any
other name is just as sweet.” I spent a couple decades writing/
maintaining Window’s application for clients using C and C++, so I’ve
a fondness for Polish notation (at least that’s what I think it was
called.)

I believe you mean Hungarian Notation:
http://en.wikipedia.org/wiki/Polish_notation

Typing extra hyphens vs pressing the shift key lets me write
code faster, and the a/s/h prefix for arrays/strings/hashes helps me
avoid a lot of interpreter complaints. And fellow programmers of
almost any stripe knows what I mean.

There’s always something to be said for conventions. The issue with
your notation is that it seems to be far less used among Ruby
programmers than the snake case. Snake case for variables and methods
also has the added advantage that classes and modules stand out
immediately.

Side note: with modern IDE’s I believe there is not much reason to use
Hungarian Notation any more. I personally find it more difficult to
spot certain variables when all variables of the same type start with
the same letter. For me, PN actually reduces readability.

Finally, I retired curmudgeon,
and you know how we old folks are :slight_smile:

LOL

Cheers

robert

2009/11/13 Mark T. [email protected]:

sNewList = %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT’,
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}
aRawNewList = sNewList.scan(/[A-Z]+/)
aNewList = Set.new(aRawNewList ).to_a.sort

can be replaced by one:

aNewList = %W{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT’,
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}

I don’t think so because that appears to be input from the outside
which is provided as single String.

(you can add .sort to the end but I don’t think you need it)

also, consider something like this:

hCurrentList.delete_if { |key,v| aNewList.include?key }

Basically the question is which of the two is larger. But if you do
it this way round (i.e. iterate the Hash and check for existence in
the new list then that should definitively be a Set).

Here’s my suggestion

require ‘set’

dumy base

current = {“CSCO” => 1, “COL” => 2, “INTC” => 3, “BRCM” => 4, “FOO” =>
99}

user input

input = %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}

algorithm

symbols = input.scan(/[A-Z]+/)
deduped = symbols.to_set

old_size = current.size
deduped.each {|sym| current.delete sym}

p(deduped.sort,
10,
sprintf(“New List (unique:%d, dups:%d, deleted:%s)”,
deduped.size,
symbols.size - deduped.size,
old_size - current.size),
true)

p current

Cheers

robert

RichardOnRails wrote:
[…]

As far as underscoring vs. Camel-case goes, I know Rubyists’
preference, but I bow to Shakespeare’s notion that “a rose by any
other name is just as sweet.”

It doesn’t work that way in programming. Good naming practices are an
important part of readable code. This is particularly so in a language
like Ruby, in which “literate” interfaces are common.

I spent a couple decades writing/

maintaining Window’s application for clients using C and C++, so I’ve
a fondness for Polish notation (at least that’s what I think it was
called.)

Polish Notation is Łukasiewicz-style prefix notation, rather like what’s
used in Lisp. You mean Hungarian Notation.

But in any case, you’ve been had. Hungarian Notation as developed by
Charles Simonyi is extremely useful in non-OO code (I’ve used it in PHP
with great success). Hungarian Notation as the term is usually
understood is a very stupid thing indeed, which has unfortunately been
foisted by Microsoft on huge numbers of Windows programmers who really
should know better. :slight_smile: It is (at best) marginally useful in statically
typed languages like C, and downright misleading in dynamically typed
languages like Ruby.

The difference is that Simonyi’s original concept encodes information
outside the scope of the variable’s type (which, after all, the
interpreter or compiler already knows about). For example, in a mapping
system, you might have kmDistance and ftCorrection. It’s entirely clear
from those names that kmDistance + ftCorrection would be adding
kilometers and feet without a conversion, and thus it’s immediately
clear that that operation is wrong.

OTOH, legions of misled Windows developers would simply call those two
variables intDistance and intCorrection, incorporating no new useful
information and making the names harder to read.

For more on the misuse of Hungarian Notation, please see
http://www.joelonsoftware.com/articles/Wrong.html (Simonyi’s original is
there called Apps Hungarian, while the popular perversion is called
Systems Hungarian). There’s also some interesting discussion at
http://c2.com/cgi/wiki?HungarianNotation , if you can wade through the
disorganization.

Systems Hungarian, BTW, is bad enough in C, where you should be able to
refer to your variable declarations. If your functions are so long that
you can’t refer easily to declarations, then you need to refactor to
shorter methods for overall readability anyway – methods should be
short. Systems Hungarian has no use at all in Ruby, since although
objects are typed, variables are not, so it’s perfectly possible to do
intValue = 1

later

intValue = {:foo => ‘bar’}

Even Apps Hungarian is not a great idea in OO code. Instead, just use
the type system, so that distance would be a Kilometer object and
correction would be a Foot object. Kilometer.+(foot) could then either
raise an exception or invoke a conversion.

In summary, then, Hungarian Notation of either sort is inappropriate in
Ruby. Drop the habit.

Typing extra hyphens vs pressing the shift key lets me write
code faster, and the a/s/h prefix for arrays/strings/hashes helps me
avoid a lot of interpreter complaints.

If you care about removing characters from variable names, start with
removing the Hungarian warts. As I explained above, they serve no
useful purpose in Ruby at all. And I have to say, I don’t find
wordsRunTogether as easy to read as words_with_underscores – the
underscores look more like spaces and delineate the words better to my
eye. WouldYouRatherReadThisClauseHere, or
would_you_rather_read_this_clause_here?

In any case, “snake_case” is the prevailing style in Ruby, and virtually
every Ruby library uses it (including the standard library and Rails) –
your code will look strange if you don’t follow suit. The examples in
Programming Ruby tend to use camelCase, but that’s more of a flaw in the
book than an indicator of Ruby practice.

And fellow programmers of
almost any stripe knows what I mean. Finally, I retired curmudgeon,
and you know how we old folks are :slight_smile:

Age is not an excuse. If you’re going to learn a language, take the
time to learn the idioms and the “spirit” of the language, not just the
bare essentials of syntax. I’ve seen far too many people try to write
C, Java, or PHP in Ruby – avoid the temptation!

Seriously, your insight was very helpful and will help me avoid a
bunch of wasteful code.

Best wishes,
Richard

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

On Nov 13, 2009, at 10:11 PM, Marnen Laibow-Koser wrote:

RichardOnRails wrote:
[…]

As far as underscoring vs. Camel-case goes, I know Rubyists’
preference, but I bow to Shakespeare’s notion that “a rose by any
other name is just as sweet.”

It doesn’t work that way in programming. Good naming practices are an
important part of readable code.

As the saying goes, “When in Rome, do as the Romans do.” You’re
speaking our language now and you want to learn to speak it like us,
even with our slang. That allows you to communicate with us better so
we can learn from each other.

Even Apps Hungarian is not a great idea in OO code. Instead, just use
the type system, so that distance would be a Kilometer object and
correction would be a Foot object. Kilometer.+(foot) could then either
raise an exception or invoke a conversion.

I would like to see us move away from considering classes to be types at
all in Ruby. Who knows what modules an object has mixed into it and who
knows what singleton methods are defined on it. A class, which is what
people traditionally take for the type, is just one piece of an object’s
identity.

James Edward G. II

James Edward G. II wrote:
[…]

Even Apps Hungarian is not a great idea in OO code. Instead, just use
the type system, so that distance would be a Kilometer object and
correction would be a Foot object. Kilometer.+(foot) could then either
raise an exception or invoke a conversion.

I would like to see us move away from considering classes to be types at
all in Ruby. Who knows what modules an object has mixed into it and who
knows what singleton methods are defined on it.

Do you make much use of singleton mixins or singleton methods in your
code? I know I don’t.

A class, which is what
people traditionally take for the type, is just one piece of an object’s
identity.

You’re right. But with a proper class system, my point about not
needing Apps Hungarian in Ruby still stands, I think. Do you disagree?

James Edward G. II

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

David T. wrote:

On 14/11/2009, at 15:21, James Edward G. II [email protected]
wrote:

class, which is what people traditionally take for the type, is just
one piece of an object’s identity.

I would still look immediately to the class of the object in order to
find out what it’s supposed to do.

I would too. James is correct that it isn’t the whole story, but it’s
the best place to start.

From there, the class definition
will probably list it’s module inclusions prominently.

As a vim user, with very limited interactive debugging,

What? You can use ruby-debug interactively in a console session. I
often do.

my primary
exploration technique will usually consist of at most a couple of
‘obj.methods.grep’ calls followed by grepping ~/gems which seems to
emphasize the actual reading of the source for object identity info.

Python’s integrated documentation would be really welcome in this
case, i think. :slight_smile:

WTF? Aren’t you familiar with RDoc? And didn’t you know that running
“gem server” will start a Web server with gem RDoc pages on port 8808?

I’m curious what you think the most correct way is to discover object
identity.

Object identity? Well, for that, you need object_id. That’s something
different than object type.

On 14/11/2009, at 15:21, James Edward G. II [email protected]
wrote:

class, which is what people traditionally take for the type, is just
one piece of an object’s identity.

I would still look immediately to the class of the object in order to
find out what it’s supposed to do. From there, the class definition
will probably list it’s module inclusions prominently.

As a vim user, with very limited interactive debugging, my primary
exploration technique will usually consist of at most a couple of
‘obj.methods.grep’ calls followed by grepping ~/gems which seems to
emphasize the actual reading of the source for object identity info.

Python’s integrated documentation would be really welcome in this
case, i think. :slight_smile:

I’m curious what you think the most correct way is to discover object
identity.

From: “David T.” [email protected]

will probably list it’s module inclusions prominently.
A human looking to documentation to find out what an object
of a partiular class is supposed to do, is one thing. But
then there’s the programmatic flipside where one could code
a method to select between different behaviors based on the
class-type of a given argument-object.

def foo(bar)
if bar.is_a? Array
do_array_thing(bar)
elsif bar.is_a? String
do_string_thing(bar)
else
… # ?
end
end

I believe it’s (variations on) the above that are viewed
as unreasonably restrictive in ruby.

It’s challenging, too, because even :respond_to? can be
misleading.

I like Og (Object Graph), an Object Relational Mapping
library in ruby providing high-level database access.

require ‘og’

class Address
property :name, String
property :company, String
property :dept, String
property :addr1, String
property :addr2, String
property :city, String
property :state, String
property :zip, String
property :country, String
belongs_to :order, Order
end

When Og is initialized, it searches ObjectSpace for
classes like the above, and detects that they are
intended to be Og-managed classes, and imbues them
with certain basic features. (It also generates the
SQL needed to create the database tables
corresponding to such classes.)

An example is that, given nothing more than the above
Address class declaration… I could now say:

result = Address.find_by_name_and_state(“Bob Jones”, “CA”)

But…! The Address.find_by_name_and_state doesn’t even
exist until the time that it is called. Part of the
magic with which an Og-managed class is imbued, is
some method_missing logic which looks for particular
method signatures, like /find_by_(.*)/ , and, at the
moment such a method is called, is tested against the
following, behind the scenes:

def method_missing(sym, args, &block)
if match = /find_(all_by|by)_([_a-zA-Z]\w
)/.match(sym.to_s)
return find_by_(match, args, &block)
elsif match = /find_or_create_by_([a-zA-Z]\w*)/.match(sym.to_s)
return find_or_create_by
(match, args, &block)
else
super
end
end

(Note: In this case, it appears Og always handles the
request via method_missing. But I’ve seen other code
in Og (or maybe Nitro) that did define the method when
it was first called, such that on subsequent invocations
the method would now already be existing.)

. . . Anyway, the point being, Ruby is pretty dynamic.

:slight_smile:

Python’s integrated documentation would be really welcome in this
case, i think. :slight_smile:

I seem to recall mention awhile back on ruby-talk of
a gem or module that integrated ri into irb, such
that one could pull up the documentation from within
irb. (I don’t have any links for that, sorry.)

Regards,

Bill

On Nov 14, 7:03 am, Ralph S. [email protected] wrote:

BK> elsif bar.is_a? String

As a newbie I would surely like to know why the language decided on
“elsif” rather than “elseif”.

Because a precedent had been set in Perl. That’s one of the
unfortunate Perlisms in Ruby.

At least Matz didn’t borrow it from Bash, which uses “elif”.

On Sat, Nov 14, 2009 at 6:03 AM, Ralph S. [email protected]
wrote:

BK> elsif bar.is_a? String

As a newbie I would surely like to know why the language decided on
“elsif” rather than “elseif”.

I’m pretty sure it’s a Perl artifact.

On Nov 14, 2009, at 12:23 AM, Bill K. wrote:

(Note: In this case, it appears Og always handles the
request via method_missing. But I’ve seen other code
in Og (or maybe Nitro) that did define the method when
it was first called, such that on subsequent invocations
the method would now already be existing.)

ActiveRecord from Rails works this way. If you would like to see the
code it starts around line 1830 of this file:

http://github.com/rails/rails/blob/master/activerecord/lib/active_record/base.rb

I seem to recall mention awhile back on ruby-talk of
a gem or module that integrated ri into irb, such
that one could pull up the documentation from within
irb. (I don’t have any links for that, sorry.)

Here’s what I have in my .irbrc file:

def ri(*names)
system(%{ri #{names.join(" ")}})
end

James Edward G. II