Help with regular expression

addis_a · August 3, 2014, 9:49pm

Hi all,

I want to extract the info from

until

.

I use Rubular to construct the regular expression:
/

(.*)</div>/

but it fails. I can successfully build the regular expression
separately as follow:

/

and/</div>/

Could anyone help me to solve this problem?

Thanks.

#########################sample file to extract#####################

a-methyldopa

  <p class="definition has-audio ">
  <span class='qDef lang-en'>Sympathoplegics<br />

Receptor activity: a2 agonist -> dec central adrenergic outflow
Use: HTN, esp with renal disease (no dec in blood flow to kidney)

alex-osu3 · August 4, 2014, 5:10am

By default, a dot won’t match newlines, so as soon as the “.” encounters
a newline, that is as greedy as the “.*” can get. You can use the m
flag to make the dot match anything(which should have been the default):

/<div class="text">(.*)<\/div>/m

…but, you should consider using Nokogiri for most of your html/xml
parsing.

alex-osu3 · August 4, 2014, 2:26pm

Thanks, 7stud.

alex-osu3 · August 6, 2014, 11:47pm

Something like this should do it:

/x[\d+]=’([\w; ]+)’;/

You need to use “” to escape special characters like square brackets.

alex-osu3 · August 4, 2014, 10:48pm

Hi 7stud,

How do I extract the following in a file like this using regular
expression:

x[0]=‘content 1’;y[0]='ccontent2 ';
x[1]=‘content3;content4;’;y[1]=‘content5;content6’;
x[2]=‘xxx;yyy;’;y[2]=‘xxx;xyx;’;

Essentially I only need the content of x[0], x[1],…, x[n] so that I
can pick out ‘content 1’ ,‘content3;content4;’, ‘xxx;yyy;’

Thanks,

alex-osu3 · August 7, 2014, 9:48pm

Thank you so much.

But what is purpose of ? I don’t see the usage of <> in pickaxe.
maybe I miss it?

alex-osu3 · August 7, 2014, 8:44pm

The Regexp I gave you before contains a matching group, all you need to
do is look for the contents of that group to get the text between the
quotes.

All I did for this was make the match multiline and match everything
rather than just word characters, spaces, and semicolons. I also made
the match non-greedy (+?) to prevent it going too far.

Just for fun I added names to the matching groups:

Here it is in Ruby:

irb(main):015:0> pp s.scan(/x[(?\d+)]=‘(?.+?)’;/m)
[[“0”,
“Use of supplemental oxygen in neonates can lead to what
ocular\npathology?”],
[“1”,
“In\nthe posterior aspect of both the right and left lungs, the
____\n(horizontal/oblique) fissure
divides the superior and inferior\nlobes.”]]

alex-osu3 · August 7, 2014, 5:05am

Thank you Joel, it works for the example I provide in the sample code.
But here is part of the actual file I have to deal with and extract
x[0],x[1],…x[n] and y[0], y[1],…,y[n].

when I apply the regular expression to the real file it only matches
part of the file.

x[0]=‘Use of supplemental oxygen in neonates can lead to what ocular
pathology?’;y[0]=‘Retinopathy of
prematurity’;z[0]=’-1’;topic[0]=‘7278’;favorite[0]=‘0’;breadCrome[0]='Respiratory

Pathology > Neonatal respiratory distress
syndrome’;cardID[0]=‘68196’;notes[0]=’
‘;fatag_link[0]=’…/images/FAFact/2013-555-3.jpg’;fatag[0]=‘2013-555’;x[1]=‘In
the posterior aspect of both the right and left lungs, the ____
(horizontal/oblique) fissure divides the superior and inferior
lobes.’;y[1]=‘Oblique’;z[1]=’-1’;topic[1]=‘7254’;favorite[1]=‘0’;breadCrome[1]=‘Respiratory
Anatomy > Lung relations’;cardID[1]=‘67934’;notes[1]=’
‘;fatag_link[1]=’…/images/FAFact/2013-545-1.jpg’;fatag[1]=‘2013-545’;

Essentially I want to get

x[0]=‘Use of supplemental oxygen in neonates can lead to what ocular
pathology?’;

x[1]=‘In the posterior aspect of both the right and left lungs, the ____
(horizontal/oblique) fissure divides the superior and inferior lobes.’;

…
x[n]

Then I want to use another regular expression to get what is included
between ’ and '. Is it possible to do that? Kind of like get everything
between ’ and '.

Once again thank you very much in advance.

alex-osu3 · August 8, 2014, 12:03am

Li CN wrote in post #1154588:

Thank you so much.

But what is purpose of ? I don’t see the usage of <> in pickaxe.
maybe I miss it?

It’s just naming the capture groups so you can see which group is which.
When you use scan it doesn’t show the names, but they’ll appear with
match and a few other methods.

If you use the =~ operator and put the Regexp on the left with named
capture groups, it’ll assign the named groups to local variables:

irb(main):024:0> /(?.)/ =~ ‘abc’
=> 0
irb(main):025:0> test
=> “a”

Regexp is a godsend when dealing with String parsing. I recommend using
rubular.com extensively to live-test your regexs, and check out the
documentation: Class: Regexp (Ruby 2.1.2)

alex-osu3 · August 7, 2014, 10:21pm

DATA.read.each_line do |line|
p $1 if line =~ /^x.?=’(.?)’;/
end

END
x[0]=‘Use of supplemental oxygen in neonates can lead to what ocular
pathology?’;
y[0]=‘Retinopathy of
prematurity’;z[0]=’-1’;topic[0]=‘7278’;favorite[0]=‘0’;breadCrome[0]='Respiratory

Pathology > Neonatal respiratory distresssyndrome’;
cardID[0]=‘68196’;
notes[0]=’’;
fatag_link[0]=’…/images/FAFact/2013-555-3.jpg’;fatag[0]=‘2013-555’;
x[1]=‘In the posterior aspect of both the right and left lungs, the
(horizontal/oblique) fissure divides the superior and inferior lobes.’;
y[1]=‘Oblique’;
z[1]=’-1’;
topic[1]=‘7254’;
favorite[1]=‘0’;
breadCrome[1]=‘Respiratory > Anatomy > Lung relations’;
cardID[1]=‘67934’;
notes[1]=’’;
fatag_link[1]=’…/images/FAFact/2013-545-1.jpg’;
fatag[1]=‘2013-545’;

alex-osu3 · August 8, 2014, 12:13am

Thank you so much.