Forum: Ruby help with regular expression

73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-03 21:49
Hi all,

I want to extract the info from
<div class="text">  until </div>.

I use Rubular to construct the regular expression:
/<div class="text">(.*)<\/div>/

but it fails.  I can  successfully build  the regular expression
separately as follow:

/<div class="text">/

and/<\/div>/

Could anyone help me to solve this problem?

Thanks.

#########################sample file to extract#####################
<div class="text">

  <h3 class="word has-audio ">
    <span class='qWord lang-en'>a-methyldopa</span>  </h3>

      <p class="definition has-audio ">
      <span class='qDef lang-en'>Sympathoplegics<br />
<br />
<b>Receptor activity</b>: a2 agonist -&gt; dec central adrenergic
outflow<br />
<b>Use</b>: HTN, esp with renal disease (no dec in blood flow to
kidney)</span>    </p>

</div>
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2014-08-04 05:10
By default, a dot won't match newlines, so as soon as the "." encounters
a newline, that is as greedy as the ".*" can get.  You can use the m
flag to make the dot match anything(which should have been the default):

    /<div class="text">(.*)<\/div>/m


...but, you should consider using Nokogiri for most of your html/xml
parsing.
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-04 14:26
Thanks, 7stud.
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-04 22:48
Hi 7stud,

How do I extract the following in a file like this using regular
expression:

x[0]='content 1';y[0]='ccontent2 ';
x[1]='content3;content4;';y[1]='content5;content6';
x[2]='xxx;yyy;';y[2]='xxx;xyx;';

Essentially I only need the content of x[0], x[1],..., x[n] so that I
can pick out 'content 1' ,'content3;content4;', 'xxx;yyy;'


Thanks,
14b5582046b4e7b24ab69b7886a35868?d=identicon&s=25 Joel Pearson (virtuoso)
on 2014-08-06 23:47
Something like this should do it:

/x\[\d+]='([\w; ]+)';/

You need to use "\" to escape special characters like square brackets.
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-07 05:05
Thank you Joel, it works for the example I provide in the sample code.
But here is part of the actual file I have to deal with and extract
x[0],x[1],...x[n] and y[0], y[1],...,y[n].

when I apply the regular expression to the real file it only matches
part of the file.



x[0]='Use of supplemental oxygen in neonates can lead to what ocular
pathology?';y[0]='Retinopathy of
prematurity';z[0]='-1';topic[0]='7278';favorite[0]='0';breadCrome[0]='Respiratory
> Pathology > Neonatal respiratory distress
syndrome';cardID[0]='68196';notes[0]='
';fatag_link[0]='../images/FAFact/2013-555-3.jpg';fatag[0]='2013-555';x[1]='In
the posterior aspect of both the right and left lungs, the ____
(horizontal/oblique) fissure divides the superior and inferior
lobes.';y[1]='Oblique';z[1]='-1';topic[1]='7254';favorite[1]='0';breadCrome[1]='Respiratory
> Anatomy > Lung relations';cardID[1]='67934';notes[1]='
';fatag_link[1]='../images/FAFact/2013-545-1.jpg';fatag[1]='2013-545';

Essentially  I want to get

x[0]='Use of supplemental oxygen in neonates can lead to what ocular
pathology?';

x[1]='In the posterior aspect of both the right and left lungs, the ____
(horizontal/oblique) fissure divides the superior and inferior lobes.';


...
x[n]

Then I want to use another regular expression to get what is included
between ' and '. Is it possible to do that? Kind of like get everything
between ' and '.

Once again thank you very much in advance.
14b5582046b4e7b24ab69b7886a35868?d=identicon&s=25 Joel Pearson (virtuoso)
on 2014-08-07 20:44
The Regexp I gave you before contains a matching group, all you need to
do is look for the contents of that group to get the text between the
quotes.

All I did for this was make the match multiline and match everything
rather than just word characters, spaces, and semicolons. I also made
the match non-greedy (+?) to prevent it going too far.

Just for fun I added names to the matching groups:
http://www.rubular.com/r/wnEWL4JjWR

Here it is in Ruby:

irb(main):015:0> pp s.scan(/x\[(?<x>\d+)\]='(?<text>.+?)';/m)
[["0",
  "Use of supplemental oxygen in neonates can lead to what
ocular\npathology?"],
 ["1",
  "In\nthe posterior aspect of both the right and left lungs, the
____\n(horizontal/oblique) fissure
 divides the superior and inferior\nlobes."]]
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-07 21:48
Thank you so much.

But what is purpose of <text>? I don't see the usage of <> in pickaxe.
maybe I miss it?
B078cb4f4fb473c7a54d1fc36d10c70e?d=identicon&s=25 Regis d'Aubarede (raubarede)
on 2014-08-07 22:21
DATA.read.each_line do |line|
  p $1 if line =~ /^x.*?='(.*?)';/
end

__END__
x[0]='Use of supplemental oxygen in neonates can lead to what ocular
pathology?';
y[0]='Retinopathy of
prematurity';z[0]='-1';topic[0]='7278';favorite[0]='0';breadCrome[0]='Respiratory
> Pathology > Neonatal respiratory distresssyndrome';
cardID[0]='68196';
notes[0]='';
fatag_link[0]='../images/FAFact/2013-555-3.jpg';fatag[0]='2013-555';
x[1]='In the posterior aspect of both the right and left lungs, the
(horizontal/oblique) fissure divides the superior and inferior lobes.';
y[1]='Oblique';
z[1]='-1';
topic[1]='7254';
favorite[1]='0';
breadCrome[1]='Respiratory > Anatomy > Lung relations';
cardID[1]='67934';
notes[1]='';
fatag_link[1]='../images/FAFact/2013-545-1.jpg';
fatag[1]='2013-545';
14b5582046b4e7b24ab69b7886a35868?d=identicon&s=25 Joel Pearson (virtuoso)
on 2014-08-08 00:03
Li CN wrote in post #1154588:
> Thank you so much.
>
> But what is purpose of <text>? I don't see the usage of <> in pickaxe.
> maybe I miss it?

It's just naming the capture groups so you can see which group is which.
When you use scan it doesn't show the names, but they'll appear with
match and a few other methods.

If you use the =~ operator and put the Regexp on the left with named
capture groups, it'll assign the named groups to local variables:

irb(main):024:0> /(?<test>.)/ =~ 'abc'
=> 0
irb(main):025:0> test
=> "a"

Regexp is a godsend when dealing with String parsing. I recommend using
rubular.com extensively to live-test your regexs, and check out the
documentation: http://www.ruby-doc.org/core-2.1.2/Regexp.html
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-08 00:13
Thank you so much.
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.