Problem with reg exp

Hi,

I need to match

str1 = “ABC12A2012”
str2 = “ABC13B2013KEYWORD”

i have to extract,
word1 = ABC, word2 = 12A2012 for str1
word1 = ABC, word2 = 13B2013 for str2

i have used
str.split(/([a-zA-Z]+)(.*)(KEYWORD?)/)

for str2 i get substrings as
“”,“ABC”,“12A2012”,“KEYWORD” – Why did the nil sub string popped up?

for str1 i get only 1 substring which is the entire string

Any one can spot the problem , please let me know.

I have a couple of comments/questions:

  1. how are you actually getting those substrings? I’m pretty sure you
    want to use a match, like =~ (or equivalent) rather than String#split
  2. I suspect that (KEYWORD?) doesn’t do what you expect. It matches
    either “KEYWORD” or “KEYWOR”.

On 13 June 2012 11:30, cyber c. [email protected] wrote:


Posted via http://www.ruby-forum.com/.


Matthew K., B.Sc (CompSci) (Hons)
http://matthew.kerwin.net.au/
ABN: 59-013-727-651

“You’ll never find a programming language that frees
you from the burden of clarifying your ideas.” - xkcd

Sorry, I realise I should have probably given some code to exemplify
what’s in my head:

str1 = “ABC12A2012”
if str1 =~ /^([a-zA-Z]+)(.*?)(?:KEYWORD)?$/
p $1, $2 # => “ABC”, “12A2012”
end

str2 = “ABC13B2012KEYWORD”
if str2 =~ /^([a-zA-Z]+)(.*?)(?:KEYWORD)?$/
p $1, $2 # => “ABC”, “13B2013”
end

To get it to work I:

  • made a non-capturing group, using the (?: … ) syntax, so the
    entire string ‘KEYWORD’ can be marked as optional.
  • made the middle anything matcher (.*) non-greedy (so it doesn’t
    automatically also capture a trailing KEYWORD) by appending a question
    mark
  • tacked a $ end-of-string marker after the optional KEYWORD thing, to
    ensure that the non-greedy anything matcher actually matches;
    otherwise it seems quite happy to match “”
  • also put a ^ start-of-string marker, since my if-statement is
    simultaneously validating the string as well as extracting the match.
    It doesn’t really do anything, you can leave it out.

You could also wrap the non-capturing keyword group in capturing
parens if you want to extract the keyword, thus:

str1 = “ABC12A2012”
if str1 =~ /^([a-zA-Z]+)(.*?)((?:KEYWORD)?)$/
p $1, $2, $3 # => “ABC”, “12A2012”, “”
end

str2 = “ABC13B2012KEYWORD”
if str2 =~ /^([a-zA-Z]+)(.*?)((?:KEYWORD)?)$/
p $1, $2, $3 # => “ABC”, “13B2013”, “KEYWORD”
end

If you don’t like using the if-statement structure, you can also use
String#match , which returns a MatchData object. You can access the
groups using array syntax:

str1 = “ABC12A2012”
md = str1.match /^([a-zA-Z]+)(.*?)((?:KEYWORD)?)$/
p md[1] => “ABC”
p md[2] => “12A2012”

etc.

String#split , on the other hand, breaks the string up by chopping out
any parts of the string that match the regexp, and returning the
remaining chunks as an array. For example:

str2.split /[0-9]+/ # => [“ABC”, “B”, “KEYWORD”]

That is, it chops out the numbers, and returns the bits in between.

I’d have to think a lot harder about how you’re getting what you’re
getting with str2, but I’m already pretty sure the actual question can
be resolved using =~ or String#match

On 13 June 2012 11:30, cyber c. [email protected] wrote:


Posted via http://www.ruby-forum.com/.


Matthew K., B.Sc (CompSci) (Hons)
http://matthew.kerwin.net.au/
ABN: 59-013-727-651

“You’ll never find a programming language that frees
you from the burden of clarifying your ideas.” - xkcd

On Wed, 2012-06-13 at 10:30 +0900, cyber c. wrote:

for str2 i get substrings as
“”,“ABC”,“12A2012”,“KEYWORD” – Why did the nil sub string popped up?

for str1 i get only 1 substring which is the entire string

That is a pretty confusing way to use split. Normally split is used with
a delimiter to split into multiple parts and you don’t want the
delimiter in the results:

“A B C”.split(/ /) => [“A”, “B”, “C”]

Your regex actually matches the whole of string str2 and none of str1.
In the case of str2 the first returned string (the empty string) is
simply the part of str2 that occurred before the delimiter, just like if
you did:

" A B C".split(/ /) => ["", “A”, “B”, “C”]

The only reason you are getting back any part of your original str2
string (which matches the entire delimiter) is because your delimiter
includes groups (the brackets) and those are returned. This is a
somewhat advanced way to use split - given the state of your regex, I
suggest you start with something simpler to experiment with.

(For str1, the regex doesn’t match at all, so no instances of the
delimiter means the entire string is returned. KEYWORD? is not an
optional match for “KEYWORD”, it is KEYWOR and an optional D.).

I suggest your forget about using split and just match using a regex and
pick up the matching groups afterwards using $1 etc, e.g.

str1 =~ /^([a-zA-Z]{3})([\w]{7})(KEYWORD)?$/
word1 = $1
word2 = $2

(Even better in this case though would be to use slices such as word 1
=str1[0,3], etc).

Good luck,
-Paul

str1 = “ABC12A2012”
if str1 =~ /^([a-zA-Z]+)(.*?)((?:KEYWORD)?)$/
p $1, $2, $3 # => “ABC”, “12A2012”, “”
end

str2 = “ABC13B2012KEYWORD”
if str2 =~ /^([a-zA-Z]+)(.*?)((?:KEYWORD)?)$/
p $1, $2, $3 # => “ABC”, “13B2013”, “KEYWORD”
end

If you don’t like using the if-statement structure, you can also use
String#match , which returns a MatchData object. You can access the
groups using array syntax:

You can also look at String#scan, if a single string contains multiple
instances of what you’re trying to match.

Hi,

cyber c. wrote in post #1064296:

Any one can spot the problem , please let me know.

Your question probably refers to your previous thread:
http://www.ruby-forum.com/topic/4402543#new

The problem is that you’re throwing together two different solutions for
the partioning problem. If you want to use split, you have to pass the
delimiter on which the string should be split.

However, split probably isn’t the right way to do this. I’d rather use
String#=~ or String#match like the others suggested.

Thanks for the suggestions.

On Wed, 2012-06-13 at 13:21 +0900, Matthew K. wrote:

I’d have to think a lot harder about how you’re getting what you’re
getting with str2, but I’m already pretty sure the actual question can
be resolved using =~ or String#match

In this case:

“ABC13B2013KEYWORD”.split(/([a-zA-Z]+)(.*)(KEYWORD?)/)
=> ["",“ABC”,“12A2012”,“KEYWORD”]

what is happening is that split’s results consist of both the strings
between the matched delimiters, and (for each match of the delimiter)
any match groups from the delimiters themselves. In this case, the regex
matches the entire string. Without match groups it would give:

“ABC13B2013KEYWORD”.split(/[a-zA-Z]+.*KEYWORD?/)
=> [""]

(An empty string since the delimiter matched from the start of the
string. Logically there should be a second empty string for the match at
the end of the string, but that is suppressed unless you give the second
arg to split).

-Paul