Matthias thx a lot for the link!
Just another thing for OP, I just filed a bug report about the POSIX
character classes, thus I discourage you to use [:lower:] and
[:upper:]. However, it seems that \p{Lower} and \p{Upper} work nicely
if you need unicode support.
Cheers
Robert
I’m curious as to what Oniguruma-specific feature you used in yours.
None, apparently
I though however that “(?=” was, never realized it was already there in
1.8.
Does [:lower:] and [:upper:] work in 1.8?
R.
([\w\s]+?) means select all uppercase words/letters ?
[\w\s] is a character class that matches “word characters” or spaces.
The + makes it one or more. The ? means make it non-greedy (only match
the minimum to make it true).
(?=\s*[A-Z][a-z]) until you reach a space between uppercase and
uppercase with lowercase later?
the (?= ) is a lookahead assertion. It looks for a match ahead,
without capturing it. So if you have any spaces, followed by an
uppercase then lowercase letter, the previous match will stop
matching.
You’ve gotten a lot of good suggestions here, but I figured I’d toss in
my own.
Instead of scanning, you can use the “[]” operator on a string, and
“,” to pull out the saved portion of the regex.
“i like APPLES AND Bananas”[/[A-Z]+/]
=>“APPLES”
#now lets make it bigger
“i like APPLES AND Bananas”[/[A-Z ]+ [A-Z][a-z]/]
=> " APPLES AND Ba"
#Too much stuff so just select what we want… using () #AND, use ,1 to just get the saved portion of the match
“i like APPLES AND Bananas”[/([A-Z ]+) [A-Z][a-z]/,1]
=> " APPLES AND"
#Now applying this to your situation…
regex=/EMISOR: ([A-Z ]+) [A-Z][a-z]/ #this is assuming there are no \n \r or \t chars in the middle, easy
enough to fix if there are
line="REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P
"