Forum: Ruby regex select multiple words in the middle of a sentence

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Mongeta 9. (Guest)
on 2009-04-07 13:50
hello,

Given a sentence with some words, I want only some words but not all of
them ...


REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P

In this case, I'm interested in the full name:

MANUELA ADORACION CEBOLLA GARCIA

I now that all names precede with the pattern EMISOR:

And after the full name, the address is lowercase except the first char.


With this patter I can find all the uppercase words: \w*[A-Z]{2}\b

But I'm only interested in the full name, so if I use: EMISOR:
\w*[A-Z]{2}\b

I only get the first name MANUELA

How I can get from there to the end of the name ?

Any help ?

thanks ...

r.
Robert D. (Guest)
on 2009-04-07 14:27
(Received via mailing list)
On Tue, Apr 7, 2009 at 11:50 AM, Raimon Fs 
<removed_email_address@domain.invalid> wrote:
>
> \w*[A-Z]{2}\b
>
> I only get the first name MANUELA
>
> How I can get from there to the end of the name ?
>
> Any help ?


With 1.9's Oniguruma (is it available for 1.8?) it's quite easy

   scan( /EMISOR:\s*((?:[A-Z]+\s*)+)(?=[A-Z][a-z])/ ).flatten

you might want to use Unicode strings though and POSIX Character Classes

   /EMISOR:\s*((?:[:upper:].... [:upper:][:lower:])/

HTH
Robert

P.S.
If you need a 1.8 version tell me I will switch to 1.8 when I find some
time.
Robert D. (Guest)
on 2009-04-07 14:39
(Received via mailing list)
On Tue, Apr 7, 2009 at 12:24 PM, Robert D. 
<removed_email_address@domain.invalid>
wrote:
>   scan( /EMISOR:\s*((?:[A-Z]+\s*)+)(?=[A-Z][a-z])/ ).flatten
 Forgive YLHS (L for lazy)

    scan( /EMISOR:\s*((?:[A-Z]+\s*?)+)(?=\s*[A-Z][a-z])/ ).flatten

I was too greedy and got you trailing spaces.
R.
Matthias R. (Guest)
on 2009-04-07 15:50
(Received via mailing list)
Robert D. wrote:
> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy

Yes[1].

-Matthias

[1]: http://oniguruma.rubyforge.org/
Robert D. (Guest)
on 2009-04-07 17:24
(Received via mailing list)
On Tue, Apr 7, 2009 at 1:49 PM, Matthias R.
<removed_email_address@domain.invalid> wrote:
> Robert D. wrote:
>> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy
>
> Yes[1].
>
> -Matthias
>
> [1]: http://oniguruma.rubyforge.org/
>
>
Matthias thx a lot for the link!
Just another thing for OP, I just filed a bug report about the POSIX
character classes, thus I discourage you to use [:lower:] and
[:upper:]. However, it seems that \p{Lower} and \p{Upper} work nicely
if you need unicode support.
Cheers
Robert
Mark T. (Guest)
on 2009-04-07 18:02
(Received via mailing list)
> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy


This shorter one works in 1.8

  scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten


I'm curious as to what Oniguruma-specific feature you used in yours.

-- Mark.
Robert D. (Guest)
on 2009-04-07 20:49
(Received via mailing list)
On Tue, Apr 7, 2009 at 3:59 PM, Mark T. <removed_email_address@domain.invalid> 
wrote:
>> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy
>
>
> This shorter one works in 1.8
>
>  scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten
>
>
> I'm curious as to what Oniguruma-specific feature you used in yours.
None, apparently ;)
I though however that "(?=" was, never realized it was already there in
1.8.
Does [:lower:] and [:upper:] work in 1.8?
R.
Mongeta 9. (Guest)
on 2009-04-07 22:24
Mark T. wrote:
>> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy
>
>
> This shorter one works in 1.8
>
>   scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten
>
>
> I'm curious as to what Oniguruma-specific feature you used in yours.
>
> -- Mark.

thanks to all, at this moment I have enough with Ruby 1.8.7, so I'm with
this one, that works perfectly.

Can you explain why this works ?

:-)

/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/

EMISOR:\s is clear to me, but why it doesn't appear later in the array,
because it hasn't () ?

The * is also clear

([\w\s]+?) means select all uppercase words/letters ?

(?=\s*[A-Z][a-z]) until you reach a space between uppercase and
uppercase with lowercase later?

thanks for your help ...

regards,

r.
Mark T. (Guest)
on 2009-04-07 23:45
(Received via mailing list)
On Apr 7, 2:23 pm, Raimon Fs <removed_email_address@domain.invalid> wrote:
>
> because it hasn't () ?
>
> The * is also clear
>
> ([\w\s]+?) means select all uppercase words/letters ?

[\w\s] is a character class that matches "word characters" or spaces.
The + makes it one or more. The ? means make it non-greedy (only match
the minimum to make it true).

> (?=\s*[A-Z][a-z]) until you reach a space between uppercase and
> uppercase with lowercase later?

the (?=  ) is a lookahead assertion. It looks for a match ahead,
without capturing it. So if you have any spaces, followed by an
uppercase then lowercase letter, the previous match will stop
matching.

-- Mark.
Kyle S. (Guest)
on 2009-04-08 00:11
(Received via mailing list)
You've gotten a lot of good suggestions here, but I figured I'd toss in
my own.

Instead of scanning, you can use the "[]" operator on a string, and
"," to pull out the saved portion of the regex.


"i like APPLES AND Bananas"[/[A-Z]+/]
=>"APPLES"

#now lets make it bigger
"i like APPLES AND Bananas"[/[A-Z ]+ [A-Z][a-z]/]
=> " APPLES AND Ba"

#Too much stuff so just select what we want.. using ()
#AND, use ,1 to just get the saved portion of the match
"i like APPLES AND Bananas"[/([A-Z ]+) [A-Z][a-z]/,1]
=> " APPLES AND"


#Now applying this to your situation...

regex=/EMISOR: ([A-Z ]+) [A-Z][a-z]/
#this is assuming there are no \n \r or \t chars in the middle, easy
enough to fix if there are

line="REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P
"

line[regex,1]
=> "MANUELA ADORACION CEBOLLA GARCIA"


--Kyle
Mongeta 9. (Guest)
on 2009-04-08 01:13
thanks to all, I've learned a lot today ...

:-)

regards and many, many, many thanks !!!

r.
This topic is locked and can not be replied to.