On 8/12/06, [email protected] [email protected] wrote:
Greetings,
I’m trying to parse the following line:
…
Hi,
although in this case I’d prefer the array.split solution here’s how
it can be done in case you really need regex:
These are incremental versions of the regex, and a test to check them.
Save to file and enjoy!
Jano
#!/usr/bin/ruby
require ‘test/unit’
DATA = “00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3
LAGUERRE”
REGEX1 =
/(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).(\d{1,4}\s\D{1,9}).(\w+).(\d?).(\w{1,14})./
add /x and # comments,
REGEX2 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+).* # day-hour
(\d{1,4}\s\D{1,9}). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x
now we’ll fix the day-hour: add -\d[AP] - that will match a slash,
a digit and either ‘A’ or ‘P’
and fix for the room: \D{1,9} replace with \w+
REGEX3 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d±\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x
To fix course name, the previous tricks aren’t enough –
there are many words, with different length. So what we’ll do?
We’ll parse the things at the end: credits and professor
To see the results, temporarily comment out the lines
that checks the course name and credits in the test
and run it with REGEX3.
To fix the professor, we’ll say that it’s tha last word on the line:
notice the \s+ before the professor group - there has to be something
fixed that separates the name from the rest - .* won’t do it.
REGEX4 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d±\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?)\s+ # credits
(\w+)\s*$ # professor
/x
Now we can try the rest two pieces: uncomment credits and
we’ll see that they are already ok, so uncomment course name as well.
Only the first word appears. So we’ll move .* inside the parentheses
and add a separating \s+
Finally some small touches:
replace separating . with \s+
REGEX5 = /
(\d{5})\s+ # control
(\D\s\w{2,4})\s+ # course
(\d{1,4}\s\D{3})\s+ # section
(\D{1,4}\s\d±\d[AP])\s+# day-hour
(\d{1,4}\s\w+)\s+ # room
(.)\s+ # course-name
(\d+)\s+ # credits
(\w+)\s$ # professor
/x
class TestRegex < Test::Unit::TestCase
def test_regex
assert DATA =~ REGEX1 # <— change number here
assert_equal “00608”, $1
assert_equal “P 135”, $2
assert_equal “001 LEC”, $3
assert_equal “Tu 2-5P”, $4
assert_equal “210 WHEELER”, $5
assert_equal “IT and Society”, $6
assert_equal “3”, $7
assert_equal “LAGUERRE”, $8
end
end