Elegant Solution to a Seemingly Simple Problem?

derek · April 18, 2010, 9:23am

Hello everyone. It’s me: Derek, again! Sorry for writing a novel here,
but I’d really appreciate some help.

I’m still working on the same program – a way to show valid course
combinations for my school schedule, using an HTML file that contains
all the courses for the semester.

I have a rough draft copy of it working, but I’d like to see an example
of a more elegant coding style than my own.

Here’s a (simplified) example of the data I’m working with:

Intro to Programming #title MW #days 9:00am-10:30am #time Dr. Smith # professor Intro to Knitting TR 9:00am-10:30am Dr. Mittens

Earlier, someone on the forum showed me a very elegant way to collect
this information (I use Nokogiri). It was:

doc = Nokogiri::HTML(open(url))

raw_course_list = doc.css(“tr”).collect { |row|
row.css(“td”).collect { |column|
column.text.strip
}
}

This would give me an array of arrays in the format
[[courseA,data,data], [courseB, data, data]].

E.g., in this case it would yield:
[[“Intro to Programming”, “MW”, “9:00am-10:30am”, “Dr. Smith”], [“Intro
to Knitting”, “TR”, “9:00am-10:30am”, “Dr. Mittens”]]

This works perfectly, except in 3 main cases.

*** Problem 1: The

does not contain course information. (It’s some
irrelevant part of the HTML). In this case, I did the following:
raw_course_data.reject! { |i| i.size != 4 }, would filtered out
non-courses. Note: no tables without course data had the size of one
with course data (in the non-simplified version, the size is actually
much larger).

So, already I think it’s ugly coding! It firsts loads ALL

contents
into arrays, then rejects them after creation.

*** Problem 2: In a few cases, some courses do not have specified days
and times yet. In those cases, the course days reads “TBA” (to be
announced), and there is no column for time. Thus, the array of such
courses is 1 less than the normal expected case.

E.g.:

Algebra TBA # notice there is now 1 for day/time now Dr. Calculator

Thus, I create ANOTHER time that Ruby goes back over the elements of
raw_course_list again. This time, the code is put right before problem
1’s fix:

raw_course_list.each { |i|
if i.size == 3
i.insert(2, “”)
end
}

So again, if an array has a size of 3, I figure it’s a valid course,
just with no time assigned, so I create a blank element between the day
and professor, just to satisfy the Course class, which these array
elements of the outer array will ultimately become. E.g. of call:
Course.new(title, day, time, professor)

*** Problem 3: Some rows of the HTML are actually a continued
description of the course in the row above. For example, a course that
has a lab might look like this:

Chemistry /w Lab TR 9:00am-10:30am Dr. Chemicals # Empty, since the above row provides the course name R # Day of the lab 11:00am-12:30pm # Time of the lab Dr. Chemicals # Lab professor

The good news is it’s the same length as a normal class. So for this, I
add a bit more code to problem 2’s code (the each block), changing the
each method to .each_with_index:

raw_course_list.each_with_index { |i, index|
if i.size == 3
i.insert(2, “”)
end

NEW CODE FOR LABS (still working out the kinks, but hopefully I

won’t need this)

lab will always have a size of 4 and a empty first element so:

if i[0].empty?
# add all the data from the lab to the previous course:
raw_course_list[index-1].push(i.each { |element| element })

# then remove lab from raw_course_list
raw_course_list.pop(index)

# index has to go back one to avoid skipping an element (since we

popped one)
index -= 1
end
}

==================================

So there you have it. Can anyone think of a way where I can improve the
quality and elegance of this code?

derek · April 18, 2010, 9:48am

On 18 April 2010 08:23, Derek C. [email protected] wrote:

Here’s a (simplified) example of the data I’m working with:
9:00am-10:30am column.text.strip This works perfectly, except in 3 main cases. Dr. Calculator } Dr. Chemicals # Lab professor
index has to go back one to avoid skipping an element (since we

Posted via http://www.ruby-forum.com/.

Do you have any control over the HTML at all? Some semantic HTML
classes
would go a long way to simplifying this.

derek · April 18, 2010, 10:56am

Derek C. wrote:

Here’s a (simplified) example of the data I’m working with:

Can you post a complete sample page somewhere?

There are lots of ways to identify more precisely which part of the HTML
you want, using CSS selectors. Most easily, if the rows are inside

then seomthing like 'table#courses tr' could do it.

But otherwise, you can select the table based on its location in the
page relative to other elements (nth, nth-child).

derek · April 18, 2010, 9:56am

Do you have any control over the HTML at all? Some semantic HTML
classes would go a long way to simplifying this.

The HTML comes from a page on my schools website

derek · April 18, 2010, 11:44am

For your contemplation:

doc = Nokogiri::HTML(open(url))
raw_course_list = doc.css(“tr”).collect { |row|
t_row = row.css(“td”).collect { |column| column.text.strip }
t_row.insert(2, “”) if (t_row[1] == “TBA”)
}.reject{ |i| i.size != 4 }

This isn’t guaranteed to work because we’re dealing with pieces of the
HTML you are parsing. Give us a sample with ALL the variations in it
and we can stop shooting in the dark.

derek · April 18, 2010, 12:21pm

There are lots of ways to identify more precisely which part of the HTML
you want, using CSS selectors. Most easily, if the rows are inside
then seomthing like 'table#courses tr' could do it.
Since my original post, I’ve been playing around with the code some
more. I made a new way of getting courses that automatically filters out
“non-course” rows. The code is:
table = doc.css(“tr”).collect { |row|
row.css(".dddefault").collect { |column|
column.text.strip
}
}
This way, “non-courses” appear as empty arrays. I still don’t know how
to neatly get rid of the empty arrays… I tried .compact! but that
doesn’t seem to work.

doc = Nokogiri::HTML(open(url))
raw_course_list = doc.css(“tr”).collect { |row|
t_row = row.css(“td”).collect { |column| column.text.strip }
t_row.insert(2, “”) if (t_row[1] == “TBA”)
}.reject{ |i| i.size != 4 }

Excellent example, I think this is much better than what I had earlier.
I guess I could now replace your reject with i.empty?, right?
PS - I changed raw_course_list to table to make it more readable.

derek · April 18, 2010, 12:14pm

Sure, I’ll post HTML examples. In this non-simplified version, there are
20 columns per row which are:

availability, course_reference_number, subject, course_number, section,
campus, credit_hours, title, days, time, cap, registered, remaining,
xl_cap, xl_registered, xl_remaining, professor, date, location,
attributes

Here’s three specific examples of the HTML that cover all the
possibilities (normal class, course with TBA day, and labs):

NR 80983 ACCT 2101 01 A 3.000 Intro to Financial Accounting MW 08:00 am-09:15 am 30 0 30 0 0 0 TBA 08/23-12/09 A 1880 C 81085 BUSA 4700 01 A 3.000 Selected Topics in Business TBA 0 0 0 0 0 0 TBA 08/23-12/09 TBA NR 80073 CHEM 1151K 01 A 4.000 Survey of Chemistry I w/Lab MF 11:00 am-12:15 pm 20 0 20 0 0 0 David Pursell (P) 08/23-12/09 A 1400 W 11:00 am-01:45 pm David Pursell (P) 08/23-12/09 A 1290

derek · April 18, 2010, 1:51pm

This way, “non-courses” appear as empty arrays. I still don’t know how
to neatly get rid of the empty arrays… I tried .compact! but that
doesn’t seem to work.

Try #flatten! instead. #compact! just gets rid of nil entries, and an
empty array is not the same as nil.

Ehsan

derek · April 18, 2010, 4:32pm

Hi –

On Sun, 18 Apr 2010, Derek C. wrote:

column.text.strip

t_row.insert(2, “”) if (t_row[1] == “TBA”)
}.reject{ |i| i.size != 4 }

Excellent example, I think this is much better than what I had earlier.
I guess I could now replace your reject with i.empty?, right?

PS - I changed raw_course_list to table to make it more readable.

I think having the condition and the reject be the last things in the
code are going to make it hard to follow it later. I wouldn’t rule out
doing something a tiny bit more procedural but maybe a little easier
to parse visually, like this:

doc = Nokogiri::HTML(open(url))
table = []
doc.css(“tr”).each do |row|
cells = row.css(“td”).map {|cell| cell.text.strip }
next unless cells.size == 4
next unless cells[1] == “TBA”
cells.insert(2, “”)
table << cells
end

You could also extract some methods, and end up with something like:

table = doc.css(“tr”).
select {|row| valid_row?(row) }.
map {|row| prepare_row(row) }

(The above is all untested.)

David

–
David A. Black, Senior Developer, Cyrus Innovation Inc.

THE Ruby training with Black/Brown/McAnally
COMPLEAT Coming to Chicago area, June 18-19, 2010!
RUBYIST http://www.compleatrubyist.com

derek · April 18, 2010, 7:51pm

On Apr 18, 1:23 am, Derek C. [email protected] wrote:

So, already I think it’s ugly coding! It firsts loads ALL contents
into arrays, then rejects them after creation.
[…]

Generalized, you have an array of values and you want to map a subset
of them to new array. There are (at least) four patterns you can use
to handle this sort of situation:

Map the unwanted elements to a ‘broken’ value and then reject the
broken values later. (What you are doing now.) This can be hard if you
don’t have a way of creating a broken value. For example, you might be
mapping all values directly to an object, but you don’t have enough
information for the object constructor and no way of making up clearly
spurious values. Further, it’s inefficient as you do the work and use
the memory of creating the object only to throw it out later.
Map the unwanted elements to nil and compact the array afterwards.
In your case, you’d need to look at the TDs in your row and decide if
you wanted to map the row to the mapping of them or nil. This is
convenient in terms of one-liners, but still slightly inefficient
because you’re creating an intermediary array packed with nils that
you don’t want. (You should be clear, though, that computational
inefficiency is not always more important than programmer convenience
of code clarity.)
Instead of using map (or the same effect under the longer name
‘collect’, as Robert apparently likes) to create a new array from your
original, explicitly create the new array and push values only as
valid. This is basically the same as above, but without the nil values
and the later compact. For example:
raw_course_list = []
doc.css(“tr”).each { |row|
tds = row.css(“td”)
if tds.have_the_values_I_want
raw_course_list << tds.map{ |col| … }
end
}
Use map (collect) on the array as in #1 or #2, but before that do a
pass through your source array and sanitize it. Sanitization might be
mapping values to nil and then compacting (thus very similar to #2),
or fixing values (as in your TBA or continued description case). This
feels cleaner, but note that this has you doing one (or two, in the
case of map+compact) passes on your data before you get around to
mapping it.

Here’s (very roughly) what I might do given what you wrote:

Assuming you’re using Ruby 1.9

course_info = []
trs = doc.css(‘tr’)
trs.each.with_index{ |row,i|
tds = row.css(‘td’)
title = …
prof = …
days = …
times = …
desc = …
next_row = trs[i+1]
if next_row && next_row.is_a_continuation?
# Add content from next_row to description
# If needed, invalidate next_row so it will be skipped
elsif title && prof && days # If you have all the information you
need
course_info << Course.new( title, prof, days )
end
}

Regardless of the approach you use, remember that even though you’re
annoyed that you are ‘processing’ (in one form or another) invalid
entries, you have to touch every row to find out if you like it or
not. It’s up to you for how you detect which are invalid and handle
them.

derek · April 19, 2010, 5:14am

doc = Nokogiri::HTML(open(url))
table = []
doc.css(“tr”).each do |row|
cells = row.css(“td”).map {|cell| cell.text.strip }
next unless cells.size == 4
next unless cells[1] == “TBA”
cells.insert(2, “”)
table << cells
end

This is interesting. So nil is returns for elements that activate the
unless statement?

Assuming you’re using Ruby 1.9

course_info = []
trs = doc.css(‘tr’)
trs.each.with_index{ |row,i|
tds = row.css(‘td’)
title = …
prof = …
days = …
times = …
desc = …
next_row = trs[i+1]
if next_row && next_row.is_a_continuation?
# Add content from next_row to description
# If needed, invalidate next_row so it will be skipped
elsif title && prof && days # If you have all the information you
need
course_info << Course.new( title, prof, days )
end
}

I like this code a lot, however, is it considered less efficient to look
at the next row to see if it is a lab for the previous row rather than
checking each row to see if it’s a lab, and if it is, adding it to the
last row instead? That way, every row won’t need to check with the next
row.

derek · April 20, 2010, 4:15am

On Apr 18, 9:15 pm, Derek C. [email protected] wrote:

This is interesting. So nil is returns for elements that activate the
unless statement?

No. Next inside a block is like an early return. The ruby code:
do_this unless something
is the same as:
unless something
do_this
end
is the same as:
if !something
do_this
end

So what’s happening in the above is that you’re creating an array
(table), and then looping through the list of rows returned from
Nokogiri. For each row, if certain criteria are met (there aren
exactly four cells, or the second cell is “TBA”) you immediately stop
processing that row and move on to the next. (“Next!” shouts the
government worker, dismissing you and moving on.) If you didn’t bail
out early, however, you inject an extra entry into the array of
strings and then shove that whole array as a new entry onto the end of
the table array.

Because you’re using the pattern of creating an array and
conditionally populating during a traversal, there are no embedded
nils to clean up later.

For comparison, here’s similar code that WOULD leave you with a table
with some nil entries (that you’d probably want to #compact
afterwards):

table = doc.css(“tr”).map do |row|
cells = row.css(“td”).map {|cell| cell.text.strip }
if cells.size==4 && cells[1]==“TBA”
cells.insert(2, “”)
cells
end
end

In the code immediately above, the value of an if statement that
matches is the value of the last expression, while the value of an if
statement that is not matches is nil.

Here’s a simpler example showing the difference between map and each-
with-injecting:

irb(main):001:0> digits = (1…9).to_a
=> [1, 2, 3, 4, 5, 6, 7, 8, 9]

irb(main):002:0> odds = []
=> []

irb(main):003:0> digits.each{ |n| if n%2==1 then odds<<n end }
=> [1, 2, 3, 4, 5, 6, 7, 8, 9]

irb(main):004:0> odds
=> [1, 3, 5, 7, 9]

irb(main):005:0> evens = digits.map{ |n| if n%2==0 then n end }
=> [nil, 2, nil, 4, nil, 6, nil, 8, nil]

irb(main):006:0> evens.compact
=> [2, 4, 6, 8]