Regexp Parsing -- What's the right way?

Greetings,

I’m trying to parse the following line:

“00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
LAGUERRE”

i’ve constructed the following regexp:
/(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).(\d{1,4}\s\D{1,9}).(\w+).(\d?).(\w{1,14})./

with a input file i’ve successfully produced the following output:
control# 00608 ---- correct
course#: P 135 ---- correct
section#: 001 LEC ---- correct
day-hour#: Tu 2 ---- missing '-5P
room#: 3 LAGUERRE, ---- should be 210 WHEELER
course-name#: M — IT and Soceity
credits#: — should be 3
prof#: 5 — should be LAGUERRE

i’m a novice to ruby and regexp. i would like to know if i’m taking the
right approach.
i’ll eventually nail it but any hints or suggestions would be useful.

appreciate the help.

[email protected] wrote:

with a input file i’ve successfully produced the following output:
right approach.
i’ll eventually nail it but any hints or suggestions would be useful.

appreciate the help.

I would go with split in this case:

t = “00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and
Soceity 3
LAGUERRE”
a = t.split
#strip from the beginning
control = a.shift
course = a.shift + ’ ’ + a.shift
section = a.shift + ’ ’ + a.shift
hour = a.shift + ’ ’ + a.shift
room = a.shift + ’ ’ + a.shift
#strip from behind
prof = a.pop
credits = a.pop
#the rest is the name
coursen = a.join(’ ')

puts “control: #{control}”
puts “course: #{course}”
puts “section: #{section}”
puts “hour: #{hour}”
puts “room: #{room}”
puts “coursen: #{coursen}”
puts “credits: #{credits}”
puts “prof: #{prof}”

cheers

Simon

On 8/12/06, [email protected] [email protected] wrote:

Greetings,

I’m trying to parse the following line:

Hi,

although in this case I’d prefer the array.split solution here’s how
it can be done in case you really need regex:
These are incremental versions of the regex, and a test to check them.
Save to file and enjoy!

Jano

#!/usr/bin/ruby
require ‘test/unit’

DATA = “00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3
LAGUERRE”

REGEX1 =
/(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).(\d{1,4}\s\D{1,9}).(\w+).(\d?).(\w{1,14})./

add /x and # comments,

REGEX2 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+).* # day-hour
(\d{1,4}\s\D{1,9}). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x

now we’ll fix the day-hour: add -\d[AP] - that will match a slash,

a digit and either ‘A’ or ‘P’

and fix for the room: \D{1,9} replace with \w+

REGEX3 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d±\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x

To fix course name, the previous tricks aren’t enough –

there are many words, with different length. So what we’ll do?

We’ll parse the things at the end: credits and professor

To see the results, temporarily comment out the lines

that checks the course name and credits in the test

and run it with REGEX3.

To fix the professor, we’ll say that it’s tha last word on the line:

notice the \s+ before the professor group - there has to be something

fixed that separates the name from the rest - .* won’t do it.

REGEX4 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d±\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?)\s+ # credits
(\w+)\s*$ # professor
/x

Now we can try the rest two pieces: uncomment credits and

we’ll see that they are already ok, so uncomment course name as well.

Only the first word appears. So we’ll move .* inside the parentheses

and add a separating \s+

Finally some small touches:

replace separating . with \s+

REGEX5 = /
(\d{5})\s+ # control
(\D\s\w{2,4})\s+ # course
(\d{1,4}\s\D{3})\s+ # section
(\D{1,4}\s\d±\d[AP])\s+# day-hour
(\d{1,4}\s\w+)\s+ # room
(.)\s+ # course-name
(\d+)\s+ # credits
(\w+)\s
$ # professor
/x

class TestRegex < Test::Unit::TestCase
def test_regex
assert DATA =~ REGEX1 # <— change number here
assert_equal “00608”, $1
assert_equal “P 135”, $2
assert_equal “001 LEC”, $3
assert_equal “Tu 2-5P”, $4
assert_equal “210 WHEELER”, $5
assert_equal “IT and Society”, $6
assert_equal “3”, $7
assert_equal “LAGUERRE”, $8
end
end

[email protected] wrote:

with a input file i’ve successfully produced the following output:
the right approach.
i’ll eventually nail it but any hints or suggestions would be useful.

appreciate the help.

Looks pretty ok to me apart from that I’d use \s instead of . to parse
white
space separating entries.

robert

Hi,

I worked on the regexp some more before I saw everyone’s response.
I was able to extract all parts except for the day hour. I was treating

  • as “-” and the literals A as “A” and P as “P” so I didn’t hit any
    matches.

I see you created line breaks with each component of the REGEXP. I will
follow that convention from now on.

I also now understand the difference between .* and \s+ as many of you
have pointed out.

I’m new to ruby as well and will continue to expreriment with it some
more.

Thanks for your responses.
[sukhchander]

Hi Simon,

That’s pretty cool.
I was looking for a utility similar to Java’s StringTokenizer. You just
pointed it out.
Ruby has so many things built in. It’s very comprehensive.

For larger regexp I assume you prefer the split/tokenize method?

I went with the Regexp approach because it occurred to me first.

Thanks.
[sukhchander]

sukhchander wrote:

Hi Simon,

That’s pretty cool.
I was looking for a utility similar to Java’s StringTokenizer. You just
pointed it out.
Ruby has so many things built in. It’s very comprehensive.

For larger regexp I assume you prefer the split/tokenize method?

I went with the Regexp approach because it occurred to me first.

Personally I’d stick with the regexp approach as it has these
advantages:

  • probably faster because you don’t have to split and then combine
    again

  • more precise with regard to matching, i.e. you can better define
    where to match plus you get the info whether the input string is
    properly formatted

Btw, if you want to dive into regexp I can recommend “Mastering Regular
Expressions”. It’s probably best to first get some basic knowledge of
RX but if you want to know how to build efficient RX etc. then that book
is definitive a great help. Ah, I get carried away…

Then there’s also tool programs that help in understanding RX visually.
RegexBuddy and Regex-Coach.

Kind regards

robert