Regex: Up for a challenge?


#1

Hello,
Just having an outstandingly hard time to even come close to being able
to translate the following string sequences in one or a series of
regular expressions. These are allowable prefix combinations in a
language I’m doing some text analysis on.

i pa(ki) pag
ma i ka pa(ki) pag
ma pa ka
ma(ki) pag ka
man
m(an) ag ka sin
m(an) ag kan(da)
m(an) ag si(pag)
m(an) ag pa

The first one allows the following choices:
i, pa, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag

The syllable inside parentheses can only be selected if the syllable or
letter it is attached to is selected.

Just something to take our minds away from Eckel-ysteria.

Thanks again,
basi


#2

My first cut at the problem would be to

  • write a generator that translates string sequences

    i pa(ki) pag

    into persistent lists of prefixes

    i, pa, pag, ipa, ipag, …

  • write a method that uses these prefixes,
    as is: /^i/, /^pa/, …

As Ken Thompson said, “When in doubt, use brute force”.

Once this worked, I’d check to see if it was “fast enough”.
If not, I’d look into reworking the methods. The important
issues are (a) getting something working quickly and (b)
keeping the implementation details hidden. If speed is a
real issue, you may need to go to a fairly fancy solution,
so don’t tie yourself to regexes!

-r


#3

On Wed, 21 Dec 2005 23:03:00 -0000, basi removed_email_address@domain.invalid wrote:

man

Maybe these are a bit naive, but they seem to (?) work for the tests I
could come up with…

  1. /\Ai?(pa(ki)?)?(pag)?\Z/
  2. /\A(ma)?(i)?(ka)?(pa(ki)?)?(pag)?\Z/
  3. /(ma)?(pa)?(ka)?/
  4. /\A(ma(ki)?)?(pag)?(ka)?\Z/
  5. /man/
  6. /\A(m(an)?)?(ag)?(ka)?(sin)?\Z/
  7. /\A(m(an)?)?(ag)?(kan(da)?)?\Z/
  8. /\A(m(an)?)?(ag)?(si(pag)?)?\Z/
  9. /\A(m(an)?)?(ag)?(pa)?\Z/

#4

On Wed, 21 Dec 2005 23:29:19 -0000, Ross B.
removed_email_address@domain.invalid wrote:

Maybe these are a bit naive, but they seem to (?) work for the tests I
could come up with…

Except of course, they match the empty string :stuck_out_tongue:


#5

Hi,
Speed is not an issue at this point. As you say, just get it working
for now. What would be a “fancy” solution?

Right now, my solution is bruter than brute force: Take the first 9
syllables (prefixes can pile up to syllables deep, which include
reduplicated syllables and an infix somewhere there. The data in my OP
does not indicate these, but reduplications and infixation are easy to
wedge into the solution). Match it with the prefix list. If no match,
take the first 8 syllables, and so forth. Shameless.

Thanks!
basi


#6

Hi,
Thanks for your suggestion. I’m checking it right now for under/over
generation!
basi


#7

“basi” removed_email_address@domain.invalid writes:

Hello,
Just having an outstandingly hard time to even come close to being able
to translate the following string sequences in one or a series of
regular expressions. These are allowable prefix combinations in a
language I’m doing some text analysis on.

i pa(ki) pag
The first one allows the following choices:
, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag

You need to use alternatives to handle the nonempty constraint.
For instance, the basic structure for the first sequence is this:

    /^(strings with i|strings with pa(ki)|strings with pag)$/

For instance, this works:

    /^(i(pa(ki)?)?(pag)?|i?pa(ki)?(pag)?|i?(pa(ki)?)?pag)$/

But it’s redundant, since several strings will match more than one of
the
alternatives. For instance, the first alt takes care of all the strings
with
i, so there’s no need for i? in the other two parts. Similarly, the
first and
second parts together handle all the strings with pa (with or without i-
and
with or without -ki), so there’s no need to include them in the third
alt.
This matches all the valid strings, and I haven’t found an invalid
string
that it matches - but it’s almost 1 AM, so I could be missing something.
:slight_smile:

    /^(i(pa(ki)?)?(pag)?|pa(ki)?(pag)?|pag)$/

If you need to worry about the difference between lines and strings, you
should
use \A and \Z instead of ^ and $. It may be more efficient to use
non-capturing parens (?:…) instead of the plain ones, but I think it
makes it harder to read and type.