Forum: Ruby Regex: Up for a challenge?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
basi (Guest)
on 2005-12-22 01:09
(Received via mailing list)
Hello,
Just having an outstandingly hard time to even come close to being able
to translate the following string sequences in one or a series of
regular expressions. These are allowable prefix combinations in a
language I'm doing some text analysis on.

i pa(ki) pag
ma i ka pa(ki) pag
ma pa ka
ma(ki) pag ka
man
m(an) ag ka sin
m(an) ag kan(da)
m(an) ag si(pag)
m(an) ag pa

The first one allows the following choices:
i, pa, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag

The syllable inside parentheses can only be selected if the syllable or
letter it is attached to is selected.

Just something to take our minds away from Eckel-ysteria.

Thanks again,
basi
Ross B. (Guest)
on 2005-12-22 01:39
(Received via mailing list)
On Wed, 21 Dec 2005 23:03:00 -0000, basi <removed_email_address@domain.invalid> 
wrote:

> man
>
Maybe these are a bit naive, but they seem to (?) work for the tests I
could come up with...

1) /\Ai?(pa(ki)?)?(pag)?\Z/
2) /\A(ma)?(i)?(ka)?(pa(ki)?)?(pag)?\Z/
3) /(ma)?(pa)?(ka)?/
4) /\A(ma(ki)?)?(pag)?(ka)?\Z/
5) /man/
6) /\A(m(an)?)?(ag)?(ka)?(sin)?\Z/
7) /\A(m(an)?)?(ag)?(kan(da)?)?\Z/
8) /\A(m(an)?)?(ag)?(si(pag)?)?\Z/
9) /\A(m(an)?)?(ag)?(pa)?\Z/
Rich M. (Guest)
on 2005-12-22 01:45
(Received via mailing list)
My first cut at the problem would be to

  *  write a generator that translates string sequences

       i pa(ki) pag

     into persistent lists of prefixes

       i, pa, pag, ipa, ipag, ...

  *  write a method that uses these prefixes,
     as is: /^i/, /^pa/, ...

As Ken Thompson said, "When in doubt, use brute force".


Once this worked, I'd check to see if it was "fast enough".
If not, I'd look into reworking the methods.  The important
issues are (a) getting something working quickly and (b)
keeping the implementation details hidden.  If speed is a
real issue, you may need to go to a fairly fancy solution,
so don't tie yourself to regexes!

-r
Ross B. (Guest)
on 2005-12-22 01:45
(Received via mailing list)
On Wed, 21 Dec 2005 23:29:19 -0000, Ross B.
<removed_email_address@domain.invalid> wrote:


> Maybe these are a bit naive, but they seem to (?) work for the tests I
> could come up with...
>

Except of course, they match the empty string :P
basi (Guest)
on 2005-12-22 05:48
(Received via mailing list)
Hi,
Speed is not an issue at this point. As you say, just get it working
for now. What would be a "fancy" solution?

Right now, my solution is bruter than brute force: Take the first 9
syllables (prefixes can pile up to syllables deep, which include
reduplicated syllables and an infix somewhere there. The data in my OP
does not indicate these, but reduplications and infixation are easy to
wedge into the solution). Match it with the prefix list. If no match,
take the first 8 syllables, and so forth. Shameless.

Thanks!
basi
basi (Guest)
on 2005-12-22 05:54
(Received via mailing list)
Hi,
Thanks for your suggestion. I'm checking it right now for under/over
generation!
basi
Mark J.Reed (Guest)
on 2005-12-22 08:04
(Received via mailing list)
"basi" <removed_email_address@domain.invalid> writes:

>Hello,
>Just having an outstandingly hard time to even come close to being able
>to translate the following string sequences in one or a series of
>regular expressions. These are allowable prefix combinations in a
>language I'm doing some text analysis on.

>i pa(ki) pag
>The first one allows the following choices:
>, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag

You need to use alternatives to handle the nonempty constraint.
For instance, the basic structure for the first sequence is this:

        /^(strings with i|strings with pa(ki)|strings with pag)$/

For instance, this works:

        /^(i(pa(ki)?)?(pag)?|i?pa(ki)?(pag)?|i?(pa(ki)?)?pag)$/

But it's redundant, since several strings will match more than one of
the
alternatives.  For instance, the first alt takes care of all the strings
with
i, so there's no need for i? in the other two parts.  Similarly, the
first and
second parts together handle all the strings with pa (with or without i-
and
with or without -ki), so there's no need to include them in the third
alt.
This matches all the valid strings, and I haven't found an invalid
string
that it matches - but it's almost 1 AM, so I could be missing something.
:)

        /^(i(pa(ki)?)?(pag)?|pa(ki)?(pag)?|pag)$/

If you need to worry about the difference between lines and strings, you
should
use \A and \Z instead of ^ and $.  It may be more efficient to use
non-capturing parens (?:...) instead of the plain ones, but I think it
makes it harder to read and type.
This topic is locked and can not be replied to.