Regular expressions question

akoSS · December 15, 2005, 4:43am

James Edward G. II wrote:

I see a lot of code running the Ruby Q. and I feel quite confident
saying that the Regexp variables are far more common than MatchData. I
don’t think that says anything bad about the latter, but it does tell
me that you are in the minority.

If the majority here uses globally scoped variables to store locally
used values, then that doesn’t say anything good for this portion of
the Ruby community.

–
Neil S. - [email protected]

‘A republic, if you can keep it.’ – Benjamin Franklin

akoSS · December 15, 2005, 3:55am

thank you. yes, it seems to be the only way. just that it is a shame
that we have to match the same expression again! the information was
available already, it was just discarded during the first match in your
sample.

konstantin

akoSS · December 15, 2005, 4:55am

On Dec 14, 2005, at 9:42 PM, Neil S. wrote:

the Ruby community.
I really think you are blowing the issue out of proportion. A Regexp
is generally checked in one-line and the variables used in the next.
It doesn’t make a lot of sense to hold on to them for twenty lines
after you make the check.

Also, I believe they are thread-local variables are they not? (I’m
honestly asking.) If so, I don’t see a lot of concern about them
being stomped on before they are used.

James Edward G. II

akoSS · December 15, 2005, 5:04am

James Edward G. II wrote:

Also, I believe they are thread-local variables are they not? (I’m
honestly asking.) If so, I don’t see a lot of concern about them being
stomped on before they are used.

Actually, I just looked it up, and according to “Programming Ruby,”
$1-$9 are “local to the current scope.”

My mistake, heh. I wonder how many who use them know that, though, and
how many just do it without checking because it’s popular in perl or
popular on here.

–
Neil S. - [email protected]

‘A republic, if you can keep it.’ – Benjamin Franklin

akoSS · December 15, 2005, 5:13am

Ross B. wrote:

Well, okay. No need to shout though, is there?

Just trying to put a bit back, you know?

Because it wasn’t just directed at you … there have been other posts
that included those same “warts”.

And I didn’t mean to shout, I just meant to exaggerate the please …
Please would have been more appropriate.

j.

akoSS · December 15, 2005, 5:25am

Neil S. wrote:

My mistake, heh. I wonder how many who use them know that, though, and
how many just do it without checking because it’s popular in perl or
popular on here.

Hold on, this wasn’t really my mistake, I think. How is one supposed to
know a dollar-sign variable isn’t always global?

This sounds to me like some special-case hackery done to keep careless
coders from shooting themselves in the foot.

–
Neil S. - [email protected]

‘A republic, if you can keep it.’ – Benjamin Franklin

akoSS · December 15, 2005, 5:22am

James Edward G. II wrote:

variables …
common feature to many Regular Expression implementation and I’m not
two-space tab, etc. I’m Ruby assimilated, but I just like the Regexp-
James Edward G. II

Quite simply, Ruby is supposed to be about consistency … Having the
“everything is an object, principal of least surprise” mantra, then
using these which act like a global ( $ ) but aren’t actually ( local
scope ) is just vile.

That’s why I have a problem with them. If the community uses them,
well, that’s their option, I’m just one that’s all for consistency,
always, as much as possible… It tends to make things more generic and
able to handle change better.

I can’t speak as to whether Perl was the first language to do the ${x}
variables … but, it ( so far of the languages I’ve learned ) uses it
heavily, and it contributes to all of the punctuation soup that we all
left Perl to get away from… ( again I’m speaking generally, but I
could also again be wrong … that’s one of the uglies I left because of
).

… Not trying to be language police, I just really love the MatchData,
and find it MUCH easier to deal with. Then you can keep your datasets
from multiple matches around … to me it is easier to read …
instead of … $1 … where’d that come from … I didn’t assign a
glob… oh that’s right …

Anyways, I’m sorry to have causes this thread to go on this long … I
just really thought more of the people on the list would step up and
say, yeah, those are some very ugly warts and we don’t use them … but
apparently, I was wrong.

I’ll shut up now.

j.

akoSS · December 15, 2005, 8:55am

Neil S. wrote:

This sounds to me like some special-case hackery done to keep careless
coders from shooting themselves in the foot.

Usually if you think about what the variable represents, it’s obvious
whether it should be thread-local or global, and ruby does it that way.
(Results of something the thread did, like call an external process, or
match a regex–those are thread local. Environment that was given when
the program started–those are global.) The local/global distinction is
not hackery, but the notation (inherited from perl) is not great.

I do kinda wish there was a consistent visual cue of some kind, like
$$foo for global and $foo for local, or $foo for global and $_foo for
local. It would also be nice to have a faster way to access user-defined
thread vars: $_foo versus Thread.current[:foo].

akoSS · December 15, 2005, 9:43am

On Thu, 15 Dec 2005 01:23:14 -0000, William J. [email protected]
wrote:

The first edition of the Pickaxe says:

“Having said all this, we have to 'fess up. Andy and Dave normally
use the $-variables rather than worrying about MatchData objects.
For everyday use, they just end up being more convenient.
Sometimes we just can’t help being pragmatic.”

Thanks. I don’t feel nearly so bad about being too lazy to get a
MatchData now

akoSS · December 15, 2005, 9:34am

From: “Neil S.” [email protected]

Hold on, this wasn’t really my mistake, I think. How is one supposed to
know a dollar-sign variable isn’t always global?

This sounds to me like some special-case hackery done to keep careless
coders from shooting themselves in the foot.

I think it’s more like when the intrepid Ruby nuby first notices
a method not suffixed with ! that modifies the receiver–and
posts to the list: This is inconsistent! This can’t be right!
This violates POLS! etc.! “All methods that modify the receiver
should end in !, right???”

And Matz points out that the rationale is somewhat different. . . .

Similarly, it doesn’t seem reasonable to condemn method-local $1…$n
as special-cace hackery designed to benefit careless coders, so much
as Ruby behaving in the most naturally useful way possbile.

Huzzah! &c.

Regards,

Bill

akoSS · December 15, 2005, 9:58am

On Thu, 15 Dec 2005 03:35:15 -0000, James Edward G. II
[email protected] wrote:

On Dec 14, 2005, at 6:16 PM, Jeff W. wrote:

You should be able to tell who this message is meant for:

Yes, I recognize that you are probably speaking at least in part to me,
since I did that in this very thread. You can call me by name if you
like. I’m a big boy and I can take it.

(Sorry, James, I assumed that was directed at me alone)

PLEASE stop sending out code that uses any of the perl ${x} variables
…

[…]

Please teach people to use MatchData objects …

I also showed a MatchData example.

I think the $1, $2 stuff just hit Jeff’s anger button, because I had
MatchData down at the bottom of my examples too but it seemed to be
missed.

I’ve used them a time or two, but honestly, they just don’t feel right
to me. I’ve stopped using the default variable, I’m using a two-space
tab, etc. I’m Ruby assimilated, but I just like the Regexp-linked
variables.

Same here - I like to stick to the convention of whatever language I’m
using. To begin with I did use a lot more Perl-like stuff (I started
with
Ruby using an old Perl book. Don’t ask…) but now i’m starting to find
the Rubyisms that work for me ($LOAD_PATH instead of $: and so on).

It seems sensible to me to save the MatchData stuff for the times when
it’s needed. Java’s regexp support is purely OOP-based (of course, being
Java ;)) and I can’t remember when I last used it for something simple -
it’s just nowhere near as convenient as matching a literal regexp and
using the numbered variables…

Cheers,
Ross

akoSS · December 15, 2005, 11:35am

ako… wrote:

thank you. yes, it seems to be the only way. just that it is a shame
that we have to match the same expression again! the information was
available already, it was just discarded during the first match in
your sample.

I still didn’t get what exactly you want. Does this help?

‘a,b ,c’.split /\s*,\s*/
=> [“a”, “b”, “c”]

Kind regards

robert

akoSS · December 15, 2005, 3:45pm

Ross B. wrote on 12/14/2005 4:32 PM:

i need to capture all matches for a group. for example if

‘ab c’ =~ /^(.)*$/

You could try:

a regex tool i’m finding invaluable is “redet” (on freshmeat)

works with a number of languages including ruby…

akoSS · December 15, 2005, 5:27pm

Quoting Neil S. [email protected]:

If the majority here uses globally scoped variables to store
locally used values, then that doesn’t say anything good for
this portion of the Ruby community.

Were you aware that $1 and friends aren’t actually globally scoped?

-mental

akoSS · December 15, 2005, 6:03pm

Neil S. [email protected] writes:

How convenient that you quote that without quoting the drawbacks listed
first…

Which drawbacks?

akoSS · December 15, 2005, 10:07am

On Thu, 15 Dec 2005 04:10:17 -0000, Jeff W. [email protected]
wrote:

PLEASE stop sending out code that uses any of the perl ${x} variables
…

Well, okay. No need to shout though, is there?

Just trying to put a bit back, you know?

Because it wasn’t just directed at you … there have been other posts
that included those same “warts”.

Ok, sorry. I’m kind of paranoid

And I didn’t mean to shout, I just meant to exaggerate the please …
Please would have been more appropriate.

I wondered that after I posted. Just my old Fidonet reflexes kicking in
I
suppose

Just to say, though, that I tend to feel when someone asks a question,
that I should give them all the alternative answers I have, and let them
choose what suits them. If we really don’t want people to use something
(which I don’t agree with in this case btw) then the way to do that is
to
make sure they completely understand that thing - only then can they
choose for themselves whether it’s good or not.

(I’m thinking specifically about the anti-pattern wars that have
consumed
a lot of good work in Java over the past few years).

So anyway, that’s what I did. It’s just natural I think to start with
the
shortest way ($1), move to the longer ways ($~[1], $~.captures[0]) and
finally the ‘long’ way (/…/.match etc). I did slip in a subtle hint at
the end that MatchData was worth looking up, with the $~.inspect line…

Cheers,
Ross

akoSS · December 15, 2005, 6:12pm

Quoting James Edward G. II [email protected]:

Also, I believe they are thread-local variables are they not?
(I’m honestly asking.) If so, I don’t see a lot of concern
about them being stomped on before they are used.

They’re method-local.

def foo
“abc” =~ /(a)/
p $1
end

foo => “a”
p $1 => nil

-mental

akoSS · December 15, 2005, 6:12pm

Neil S. [email protected] writes:

popular on here.
Actually, these variables are thread-local in Perl too:

use strict;
use threads;

“foo” =~ /(\w+)/;
print($1, “\n”);
threads->new(sub {
“bar” =~ /(\w+)/;
print($1, “\n”);
})->join;
print($1, “\n”);

prints:

foo
bar
foo

akoSS · December 15, 2005, 6:46pm

On 12/15/05, Robert K. [email protected] wrote:

ako… wrote:

thank you. yes, it seems to be the only way. just that it is a shame
that we have to match the same expression again! the information was
available already, it was just discarded during the first match in
your sample.

I still didn’t get what exactly you want. Does this help?

‘a,b ,c’.split /\s*,\s*/
=> [“a”, “b”, “c”]

Now that I’ve read the responses in this thread a few times, I think
I understand what he wants to do. And I don’t think it can be done
via scan.

First: He wants a single regex which will verify the syntax of an
entire line. So, first he wants a true/false value, saying “The line
is valid, or it is not valid”. Never mind any values in the line, just
“is the line completely valid?”.

Then, if the line is valid, he wants to break out individual pieces
of what was scanned, and he wants to do that without re-doing
any of the scans he did in the first regex. The trick is that some
of those pieces are a repeating group, such as /(\s\w)*/.

What is confusing us is that he describes this using a simple
example, and when we solve the simple example he then says
“you don’t get the bigger picture!”. Ugh.

Let me give an example, and see if someone can solve it. My
example might still be something other than what he’s thinking
of, but maybe it will help.

Let’s say I’m expecting command lines of the form:
first word is either ‘copy’ or ‘duplicate’
followed by one or more words
followed by the word ‘before’ or ‘after’
followed by one or more words

So I could do the first step with the regexp:

/^(copy|duplicate) \s+ (\w+\s+)+ (before|after) \s+ (\w+\s*)+ $/x

(hopefully I’ve done that right!). IF that matches, then I know
the entire line is valid. Then, after I know the line is valid, I want
the array of source-words, and the array of destination-words
which were matched. I want to do that by picking out information
in Matchdata, not by doing a new scan. The thing is, I don’t think
I have a way of knowing how many times the first ‘(\d+\s+)+’ was
matched. So I can’t just do a slice of $~.captures because I don’t
know what the starting and ending indexes of that slice would be.
I could put another set of parenthesis around the two repeating
groups:

/^(copy|duplicate) \s+ ((\w+\s+)+) (before|after) \s+ ((\w+\s*)+) $/x

But that doesn’t really give me two separate arrays of the
individual values that made up each group. It just matches
each group as a whole.

Given two data lines of:
copy apple pear plum peach after bill bob
duplicate tomato before joe alice alfred tommy jane

in the first case I want a way to set two arrays:
srcfood = ["apple ", "pear ", "plum ", "peach "]
destword = ["bill ", “bob”]
from the first line, and
srcfood = ["tomato "]
destword = ["joe ", “alice”, "alfred ", "tommy ", “jane”]
from the second line.

I’ll agree this is a weird example, but I think it shows the issue.
If I apply the above pattern to the first line, I’ll see a Matchdata
result where:

$~.captures ==
[“copy”, "apple pear plum peach ", "peach ", “after”, “bill bob”,
“bob”]

Notice: There isn’t any element which contains a value of just "apple
",
or just "pear ", or just "plum ", even though the regex obviously had to
match each one of those.

akoSS · December 15, 2005, 8:28pm

Garance A Drosehn wrote:

What is confusing us is that he describes this using a simple
followed by the word ‘before’ or ‘after’
in Matchdata, not by doing a new scan. The thing is, I don’t think
each group as a whole.
destword = ["joe ", “alice”, "alfred ", "tommy ", “jane”]
from the second line.

I’ll agree this is a weird example, but I think it shows the issue.
If I apply the above pattern to the first line, I’ll see a Matchdata
result where:

$~.captures ==
[“copy”, "apple pear plum peach ", "peach ", “after”, “bill bob”, “bob”]

DATA.each {|line| line.chomp!
md =
/^(?:copy|duplicate) \s+
((?:\w+\s+)+)
(?:after|before) \s+
((?:\w+\s*)+) $
/x.match( line )
p md.captures
src_food = md.captures.first.split
dest_word = md.captures.last.split
p src_food, dest_word
}

END
copy apple pear plum peach after bill bob
duplicate tomato before joe alice alfred tommy jane

----- output: -----

["apple pear plum peach ", “bill bob”]
[“apple”, “pear”, “plum”, “peach”]
[“bill”, “bob”]
["tomato ", “joe alice alfred tommy jane”]
[“tomato”]
[“joe”, “alice”, “alfred”, “tommy”, “jane”]