Regex select multiple words in the middle of a sentence

raimon · April 7, 2009, 11:50am

hello,

Given a sentence with some words, I want only some words but not all of
them …

REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P

In this case, I’m interested in the full name:

MANUELA ADORACION CEBOLLA GARCIA

I now that all names precede with the pattern EMISOR:

And after the full name, the address is lowercase except the first char.

With this patter I can find all the uppercase words: \w*[A-Z]{2}\b

But I’m only interested in the full name, so if I use: EMISOR:
\w*[A-Z]{2}\b

I only get the first name MANUELA

How I can get from there to the end of the name ?

Any help ?

thanks …

r.

raimon · April 7, 2009, 12:27pm

On Tue, Apr 7, 2009 at 11:50 AM, Raimon Fs [email protected] wrote:

\w*[A-Z]{2}\b

I only get the first name MANUELA

How I can get from there to the end of the name ?

Any help ?

With 1.9’s Oniguruma (is it available for 1.8?) it’s quite easy

scan( /EMISOR:\s*((?:[A-Z]+\s*)+)(?=[A-Z][a-z])/ ).flatten

you might want to use Unicode strings though and POSIX Character Classes

/EMISOR:\s*((?:[:upper:]… [:upper:][:lower:])/

HTH
Robert

P.S.
If you need a 1.8 version tell me I will switch to 1.8 when I find some
time.

raimon · April 7, 2009, 12:39pm

On Tue, Apr 7, 2009 at 12:24 PM, Robert D. [email protected]
wrote:

scan( /EMISOR:\s*((?:[A-Z]+\s*)+)(?=[A-Z][a-z])/ ).flatten
Forgive YLHS (L for lazy)

scan( /EMISOR:\s*((?:[A-Z]+\s*?)+)(?=\s*[A-Z][a-z])/ ).flatten

I was too greedy and got you trailing spaces.
R.

raimon · April 7, 2009, 1:50pm

Robert D. wrote:

With 1.9’s Oniguruma (is it available for 1.8?) it’s quite easy

Yes1.

-Matthias

raimon · April 7, 2009, 3:24pm

On Tue, Apr 7, 2009 at 1:49 PM, Matthias R.
[email protected] wrote:

Robert D. wrote:

With 1.9’s Oniguruma (is it available for 1.8?) it’s quite easy

Yes1.

-Matthias

Matthias thx a lot for the link!
Just another thing for OP, I just filed a bug report about the POSIX
character classes, thus I discourage you to use [:lower:] and
[:upper:]. However, it seems that \p{Lower} and \p{Upper} work nicely
if you need unicode support.
Cheers
Robert

raimon · April 7, 2009, 4:02pm

With 1.9’s Oniguruma (is it available for 1.8?) it’s quite easy

This shorter one works in 1.8

scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten

I’m curious as to what Oniguruma-specific feature you used in yours.

– Mark.

raimon · April 7, 2009, 8:24pm

Mark T. wrote:

With 1.9’s Oniguruma (is it available for 1.8?) it’s quite easy

This shorter one works in 1.8

scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten

I’m curious as to what Oniguruma-specific feature you used in yours.

– Mark.

thanks to all, at this moment I have enough with Ruby 1.8.7, so I’m with
this one, that works perfectly.

Can you explain why this works ?

/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/

EMISOR:\s is clear to me, but why it doesn’t appear later in the array,
because it hasn’t () ?

The * is also clear

([\w\s]+?) means select all uppercase words/letters ?

(?=\s*[A-Z][a-z]) until you reach a space between uppercase and
uppercase with lowercase later?

thanks for your help …

regards,

r.

raimon · April 7, 2009, 6:49pm

On Tue, Apr 7, 2009 at 3:59 PM, Mark T. [email protected] wrote:

With 1.9’s Oniguruma (is it available for 1.8?) it’s quite easy

This shorter one works in 1.8

scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten

I’m curious as to what Oniguruma-specific feature you used in yours.
None, apparently
I though however that “(?=” was, never realized it was already there in
1.8.
Does [:lower:] and [:upper:] work in 1.8?
R.

raimon · April 7, 2009, 9:45pm

On Apr 7, 2:23 pm, Raimon Fs [email protected] wrote:

because it hasn’t () ?

The * is also clear

([\w\s]+?) means select all uppercase words/letters ?

[\w\s] is a character class that matches “word characters” or spaces.
The + makes it one or more. The ? means make it non-greedy (only match
the minimum to make it true).

(?=\s*[A-Z][a-z]) until you reach a space between uppercase and
uppercase with lowercase later?

the (?= ) is a lookahead assertion. It looks for a match ahead,
without capturing it. So if you have any spaces, followed by an
uppercase then lowercase letter, the previous match will stop
matching.

– Mark.

raimon · April 7, 2009, 11:13pm

thanks to all, I’ve learned a lot today …

regards and many, many, many thanks !!!

r.

raimon · April 7, 2009, 10:11pm

You’ve gotten a lot of good suggestions here, but I figured I’d toss in
my own.

Instead of scanning, you can use the “[]” operator on a string, and
“,” to pull out the saved portion of the regex.

“i like APPLES AND Bananas”[/[A-Z]+/]
=>“APPLES”

#now lets make it bigger
“i like APPLES AND Bananas”[/[A-Z ]+ [A-Z][a-z]/]
=> " APPLES AND Ba"

#Too much stuff so just select what we want… using ()
#AND, use ,1 to just get the saved portion of the match
“i like APPLES AND Bananas”[/([A-Z ]+) [A-Z][a-z]/,1]
=> " APPLES AND"

#Now applying this to your situation…

regex=/EMISOR: ([A-Z ]+) [A-Z][a-z]/
#this is assuming there are no \n \r or \t chars in the middle, easy
enough to fix if there are

line="REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P
"

line[regex,1]
=> “MANUELA ADORACION CEBOLLA GARCIA”

–Kyle