Regex question: this should be easy but doesn't work as I ex

Hi all,

-----Code------

re = [
/(one).+?(three).+?(five)/,
/(one).+?(three)?.+?(five)/,
/(one).+?(three|).+?(five)/,
/(one).+(three|).+?(five)/
]

re.each_with_index do |r, idx|
puts idx
p “one two three four five”.scan®
p “one two four five”.scan®
end

-----Result---------

0
[[“one”, “three”, “five”]]
[]
1
[[“one”, nil, “five”]]
[[“one”, nil, “five”]]
2
[[“one”, “”, “five”]]
[[“one”, “”, “five”]]
3
[[“one”, “”, “five”]]
[[“one”, “”, “five”]]


All regexes failed my expectation.

What I want is

“one two three four five” #=> [[“one”, “three”, "five’]]
“one two four five” #=> [[“one”, nil, "five’]]

In short, in the string, “three” might or might not exist.
What regex can match for both?

Thanks.

Sam

Sam K. wrote:

[]

What regex can match for both?

Thanks.

Sam

/(one) two (?:(three) )?four (five)/

Sam K. wrote:

What I want is

“one two three four five” #=> [[“one”, “three”, "five’]]
“one two four five” #=> [[“one”, nil, "five’]]

In short, in the string, “three” might or might not exist.
What regex can match for both?

/one|three|five/ can. Although, its result is not exactly in the form
you want:

“one two three four five”.scan /one|three|five/
=> [“one”, “three”, “five”]

“one two four five”.scan /one|three|five/
=> [“one”, “five”]

Sam K. wrote:

The image is sometimes missing.

In the example, let’s assume that “two” and “four” are arbiturary text.
So the text might be “…one…three…five” where “…” means some
arbiturary text.
If “three” is missing, it will be “…one…five…”.

Can you reconsider the problem please?

Sam

This is prolix, but it works:

a,b,c = ‘one’, ‘three’, ‘five’

[
“one two three four five”,
“one two four five”
].each{|s|
if s =~ /#{a} (.+ )?#{c}/
if s =~ / #{b} /
p [a,b,c]
else
p [a,nil,c]
end
end
}

Hi William,

William J. wrote:

/(one) two (?:(three) )?four (five)/

I simplified the actual problem.
I guess the simplification did not interpret my problem well.

I was parsing html source into price, image, description, etc.
The image is sometimes missing.

In the example, let’s assume that “two” and “four” are arbiturary text.
So the text might be “…one…three…five” where “…” means some
arbiturary text.
If “three” is missing, it will be “…one…five…”.

Can you reconsider the problem please?

Sam

On 12/20/06, Sam K. [email protected] wrote:
[snip]

“one two three four five” #=> [[“one”, “three”, "five’]]
“one two four five” #=> [[“one”, nil, "five’]]
[snip]

how about?

irb(main):001:0> r = /(one) (?: (.?three) | ((?:.(?!>three))) ) ?
(five)/x
=> /(one) (?: (.
?three) | ((?:.(?!>three))*) ) *? (five)/x
irb(main):002:0> “one two three four five”.scan(r)
=> [[“one”, " two three", " four ", “five”]]
irb(main):003:0> “one two four five”.scan(r)
=> [[“one”, nil, " two four ", “five”]]

[Sam K. [email protected], 2006-12-20 20.10 CET]

What I want is

“one two three four five” #=> [[“one”, “three”, "five’]]
“one two four five” #=> [[“one”, nil, "five’]]

In short, in the string, “three” might or might not exist.
What regex can match for both?

Hi. The problem is that you can very easily NOT match “three” even if
it’s
there. I mean, if you have
/1.*3?.*5/ =~ ‘12345’

the engine can succeed matching the 1 at the beginning, the 5 at the
end,
and trying to match the 3 where the 4 is… and failing, but since it’s
optional, the overall match succeeds.

I think you should try it in two steps: first, try to match with the 3;
if
that fails, without the “3”. Something like:

/(?:(1).(3)|(1)).(5)/

(The ‘1’ will come either on the first or third array position, you’ll
have
to take care of that.)

Maybe there is a simpler solution, but it doesn’t come to my mind.

Good luck.

Hi Carlos,

Carlos wrote:

/(?:(1).(3)|(1)).(5)/

(The ‘1’ will come either on the first or third array position, you’ll have
to take care of that.)

Yes, you understand exactly what my problem is.
Actually I guessed it as you said even if I couldn’t explain it as well
as you did.
The solution I found was using 2 regexes.
First, I try to find a match assuming “three” is there.
If it fails, I try to find a match without “three”.
This solved my problem.
But I wanted to know that if there’s a one-shot solution.

This is the actual problem, just in case someone wants to know.

html = <<END

2004 Used
        <a name="210819526" href="210819526.html">BMW 325Ci
        Coupe</a><br />

    </h5></td>


<td class="mileage">

    <span class="body20">38,604<br /></span><span

class=“body30”>Mileage

                $24,995



        <br />
    </span>


        <span class="body30">Price</span>
    </td>

<td class="distanceFromZip">
    <div class="zip">
        <span class="body20">0 mi<br /></span><span

class=“body30”>from ZIP

    <td class="productTileCell" rowspan="2" valign="top">

        <div class="srlProductContainer">




        </div>


    </td>
            <a href=210819526.html><img

src=“http://images.autotrader.com/images/2006/10/16/210/819/1092478286.210819526.IM1.MAIN.60x45_A.60x45.jpg
border=“0” bordercolor=“#000000” width=“60” height=“45”> 

            <div class="body40" style="padding-bottom:3px">
                    <img

src=“http://www.autotrader.com/img/fyc/icn_camera_17x17.gif
alt=“Actual Photo Available” width=“17” height=“17” border=“0”
/> 9 Photos


    <img src="http://www.autotrader.com/img/blank_dot.gif"

width=“60” height=“1” />

                <p class="color body20">Color - Mystic Blue

Metallic

            <p class="description">Dark Blue/Beige, Premium Pkg,

Xenon Light, Single Compact Disc, Dual Power Seats, Memory Seat, Still
under Free BMW Maintenance and 4yr/50k Factory…

                <p class="vin">VIN WBABV13454JT20104</p>





            <div class="body40" style="padding-top:5px;"><a

name=“210819526” href=“210819526.html”>View Car Details

    </div></td>
<td>&nbsp;</td>
<td valign="top" class="right body30">




    <p class="dealername">



            <a name="210819526" href="210819526.html">null</a>
            <br />


    </p>








    <br />

</td>
END

def parse_row row
m = row.scan(/.+?

(\d{4}) Used</h5>.+?
.+?<a name="\d+"
href="(\d+.html)">(.+?)</a><br />.+?</h5>.+?<span
class="body20">([0-9,]+)<br /></span><span
class="body30">Mileage</span>.+?($[0-9,]+).+?(http://[^"]+?.jpg).+?Color

  • (.+?)</p>/m)
    if m[0].nil?
    m = row.scan(/.+?
    (\d{4}) Used</h5>.+?
    .+?<a name="\d+"
    href="(\d+.html)">(.+?)</a><br />.+?</h5>.+?<span
    class="body20">([0-9,]+)<br /></span><span
    class="body30">Mileage</span>.+?($[0-9,]+).+?(http://[^"]+?.jpg)?.+?Color
  • (.+?)</p>/m)
    end
    m[0]
    end

p parse_row(html)

Sorry about the messy code.

Thanks.

Sam