ActiveRecord Classes

I’m having a little trouble with understanding how to work out the
schematic for some of my classes using ActiveRecord when a file is in my
lib directory:

Brief example:

Here’s the outline of the files in use:

…app
…controllers
…application_controller.rb
…rushing_offenses_controller.rb
…models
…rushing_offense.rb
…lib
…scraper.rb
…tasks
…scraper.rake

The rushing_offense.rb file contains:

class RushingOffense < ActiveRecord::Base
end

The scraper.rb file contains:

class Scraper < ActiveRecord::Base

METHOD that define which URL to parse

METHOD that parses the data into an instanced variable called @rows

METHOD that should be updating my database table called

“rushing_offenses”

Update Rushing Offense

def update_rushing_offense
for i in 0…@numrows-1
update_all(:name => @rows[i][0], :games => @rows[i][1])
puts “Updating Team Name = #{@rows[i][0]}.”
end
end
end

The scraper.rake file contains:

desc “This task will parse data from ncaa.org and upload the data to our
db”
task :scraper => :environment do
offensive_rushing =
Scraper.new(‘http://web1.ncaa.org/mfb/natlRank.jsp?year=2008&rpt=IA_teamrush&site=org’,
‘table’, ‘statstable’, ‘//tr’)
offensive_rushing.scrape_data
offensive_rushing.clean_celldata
offensive_rushing.print_values
offensive_rushing.update_rushing_offense # the call to the method
above
end

Now if I run the rake file what is going to happen is I’m going to get
an error stating:

Table ‘project_development.scrapers’ doesn’t exist:

I believe I understand why that’s happening but I’m not sure how to fix
it from a long term perspective. Here’s why…

The class Scraper is pushed into the ActiveRecord::Base so it believes
the class is the pluralized name of the table Scrapers. I then thought
well maybe I need to put the code in the rushing_offenses_controller.rb
file in that class but here’s the issue I’m having:

The Scraper class should be a class that I can call with other classes
to do repetitive tasks on many different URLs. I’ve setup the class to
do that with the methods being able to retrieve different URLs.

So, I want my Scraper class to just act like a utility class to be used
by other classes to parse data, and upload it to the correct database
table. If I place the scraper class inside the
rushing_offenses_controller file then I’m not following DRY principles.
I don’t want to have to repeat code over and over.

Any ideas on how I can rectify this issue I’m having?

To expand upon the issue:

There are approximately 37 different categories for College Football
that house statistics. I will be parsing 37 different URLs to pull and
retrieve data that will be pushed to my database. The Scraper class is
the tool for doing that.

Each call within my rake task is going to call specific URLs using the
methods located in the Scraper class but will update to specific table
names.

Example:

rushing_offense.rb —> connects to the rushing_offenses table
passing_offense.rb —> connects to the passing_offenses table
scoring_offense.rb —> connects to the scoring_offenses table

Call to scraper.rb to parse data from a rushing offense URL
Call to scraper.rb to update data to rushing_offenses table
Call to scraper.rb to parse data from a passing offense URL
Call to scraper.rb to update data to passing_offenses table
Call to scraper.rb to parse data from a scoring offense URl
Call to scraper.rb to update data to scoring_offenses table
etc. etc.
– for 37 different categories

To add another thought to the mix:

The only reason why I’m defining a rake task is that eventually the rake
task will be managed by a cron job for populating the data for my
database on a weekly basis (say every sunday night).

The main bulk of the remainder of my project will just be dealing with
controllers and views for how the site is listed…

So, the population of data from an external source is the big issue
right now.

I think I found my own answer to the last question - a single class
cannot inherit across multiple classes. :frowning:

Another thing I considered is inheritance.

If I do

class Scraper < RushingOffenses then the RushingOffenses class located
in the rushing_offense.rb model would inherit it. Then I could possibly
put the following in my rake task:

offensive_rushing = RushingOffense::Scraper.new

However, I would want Scraper to be a part of every statistical class I
create. So, it would have to be a member of:

RushingOffense
PassingOffense
ScoringOffense
etc…

How would I force inheritance across multiple classes?

On Jun 7, 8:30 pm, “J. D.” [email protected] wrote:

I think I found my own answer to the last question - a single class
cannot inherit across multiple classes. :frowning:

Does Scraper need to be an activerecord class at all ? you could pass
to it the class whose table needs to be updated ie

def do_something(some_klass)
some_klass.update_all(…)
end

or perhaps you might want to couple things a little more loosely

def do_something(some_klass)
some_klass.handle_scraper_data(…)
end

Fred

On Jun 7, 10:01 pm, “J. D.” [email protected] wrote:

Any ideas of what I might be doing wrong?

You’re not using update_all correctly - check the documentation

Fred

Frederick C. wrote:

Does Scraper need to be an activerecord class at all ? you could pass
to it the class whose table needs to be updated ie

def do_something(some_klass)
some_klass.update_all(…)
end

or perhaps you might want to couple things a little more loosely

def do_something(some_klass)
some_klass.handle_scraper_data(…)
end

Fred

Hi Fred:

Here’s what I managed to do on my own (believe it or not - lol ):

My Rake Task:

Basically calling the RushingOffense class from models

desc “Parse Rushing Offenses data from ncaa.org
task :parse_rushing_offenses => :environment do
update_rushing = RushingOffense.new
update_rushing.scrape
end

My Model for Rushing Offense:

Which basically I created a method for “scrape” to scrape data utilizing
the Scraper class. Since this model has inheritance with ActiveRecord
it should be able to update…

class RushingOffense < ActiveRecord::Base
def scrape
offensive_rushing =
Scraper.new(‘http://web1.ncaa.org/mfb/natlRank.jsp?year=2008&rpt=IA_teamrush&site=org’,
‘table’, ‘statstable’, ‘//tr’)
offensive_rushing.scrape_data
offensive_rushing.clean_celldata
for i in 0…offensive_rushing.numrows-1
puts “Updating Team Name = #{offensive_rushing.rows[i][1]}.”
RushingOffense.update_all(:name => offensive_rushing.rows[i][1],
:games => offensive_rushing.rows[i][2])
end
end
end

Then finally, I have my scraper.rb file

#== Scraper Version 1.0

#Created By: Elricstorm

_Special thanks to Soledad Penades for his initial parse idea which I

worked with to create the Scraper program.

His article is located at

http://www.iterasi.net/openviewer.aspx?sqrlitid=wd5wiad-hkgk93aw8zidbw_

require ‘hpricot’
require ‘open-uri’

This class is used to parse and collect data out of an html element

class Scraper #< ActiveRecord::Base
#class Scraper
attr_accessor :url, :element_type, :clsname, :childsearch, :doc,
:numrows, :rows

Define what the url is, what element type and class name we want to

parse and open the url.
def initialize(url, element_type, clsname, childsearch)
@url = url
@element_type = element_type
@clsname = clsname
@childsearch = childsearch
@doc = Hpricot(open(url))
@numrows = numrows
@rows = rows
end

Scrape data based on the type of element, its class name, and define

the child element that contains our data
def scrape_data

@rows = []

(doc/"#{@element_type}.#{@clsname}#{@childsearch}").each do |row|
  cells = []
  (row/"td").each do |cell|

    if (cell/" span.s").length > 0
      values = (cell/"span.s").inner_html.split('<br />').collect{ 

|str|
pair = str.strip.split(‘=’).collect{|val| val.strip}
Hash[pair[0], pair[1]]
}

      if(values.length==1)
        cells << cell.inner_text.strip
      else
        cells << values.strip
      end

    elsif
      cells << cell.inner_text.strip
    end
  end
  @rows << cells
end
@rows.shift # Shifting removes the row containing the <th> table 

header elements.
@rows.delete([]) # Remove any empty rows in our array of arrays.
@numrows = @rows.length
end

def clean_celldata
@rows[@numrows-1][0] = 120
end

Print a joined list by row to see our results

def print_values
puts “Number of rows = #{numrows}.”
for i in 0…@numrows-1
puts @rows[i].join(', ')
end
end
end


Now the only problem I have now is when I run the rake task, I don’t get
any errors and I see the puts for each team as it’s being updated (or
supposed to be updated). So, it’s counting each row as I expected.

I only tried to update 2 fields just for a test… but no data is being
listed in the database…

Any ideas of what I might be doing wrong?

This still has been a great day because even though I’ve seen tons of
errors, I’m learning…

Hi Fred,

Yeah I’m stuck with this one. I’ve checked the documentation but I’m
just not following it.

What I basically need it to do is to update the table with the data
that’s parsed into @rows.

In this case @rows is listed by:

offensive_rushing.rows[i][1] (:name)
offensive_rushing.rows[i][2] (:games)

I was trying to do a for loop to go through all of the rows and send the
new data to the database. I’m just not sure how to do it properly. I
catch on quick but I’ve been searching the web and reading the
documentation and I just don’t see a very detailed model for what I’m
trying to do.

So, in a readability format what I see is:

for i in 0…offensive_rushing.numrows-1
–> starting my loop and it’s going to repeat approx 120 times (120
teams)
puts “Updating Team Name = #{offensive_rushing.rows[i][1]}.”
–> Print me out an update to show me that you are updating the teams
RushingOffense.update_all(:name => offensive_rushing.rows[i][1],
:games => offensive_rushing.rows[i][2])
–> Update the :name with the name of the team
–> Update the :games with the number of games that team has played
–> Update it if the team already exists (not sure how to do this part)
–> Add new data if the team doesn’t exist (don’t know how to do this
part)

I hope that helps…

On Jun 8, 12:02 am, Frederick C. [email protected]
wrote:

On Jun 7, 10:01 pm, “J. D.” [email protected] wrote:

Any ideas of what I might be doing wrong?

You’re not using update_all correctly - check the documentation

Well the documentation may not mention the usage you are using, but it
does exist, sorry about that. You do seem to be using it slightly
oddly though: you call update_all multiple times, but you don’t
specify any conditions, so each call to update_all overwrites the
changes made by the previous one.

Fred

On Jun 8, 12:22 am, “J. D.” [email protected] wrote:

Hi Fred,

puts “Updating Team Name = #{offensive_rushing.rows[i][1]}.”
→ Print me out an update to show me that you are updating the teams
RushingOffense.update_all(:name => offensive_rushing.rows[i][1],
:games => offensive_rushing.rows[i][2])
→ Update the :name with the name of the team
→ Update the :games with the number of games that team has played
→ Update it if the team already exists (not sure how to do this part)
→ Add new data if the team doesn’t exist (don’t know how to do this
part)

Sounds like you shouldn’t be using update_all at all here, rather you
should be using find to find an appropriate row to update and if there
is none, create a new one.

Fred

Frederick C. wrote:

Sounds like you shouldn’t be using update_all at all here, rather you
should be using find to find an appropriate row to update and if there
is none, create a new one.

Fred

Again, the problem is I don’t know how. I’m simply guessing based on
what I see with the documentation. I don’t have any working examples
and most of the tutorials I see are very basic…

How I plan to manage the data is important as well.

For instance, I want to keep weekly data snapshots. So, as an example
just using the rushing offense table:

A user will be able to check by a particular week (the cron job will run
the rake task once per week)

Therefore, my database table needs to account for “new data” every
single week.

Scenario:

Rake Task begins
Check for weekly snapshot data (for current week)
– If no snapshot data then create it
– If data already exists for current week do nothing
Next Week
Rake Task begins
Check for weekly snapshot data (for current week)
– If no snapshot data then create it
– If data already exists for current week do nothing

So, let’s look at my current table structure:

:rank
:name
:games
:carries
:net
:avg
:tds
:ydspg
:wins
:losses
:ties

So, the first issue I see is that I do not have a column that accounts
for some type of weekly snapshot event notification. Would you
recommend this be tied to a timestamp? How would I check (based on the
conditions above) to check against a particular timestamp range and
produce the results…?

Or should I create another column to check this out?

And, lastly, is there somewhere online that code is available to view
for “advanced table manipulation”? Much of the code that I have found
is either very outdated, very basic, or not something I can use. The
documentation is a decent start but it does not contain a lot of
advanced examples…

I know I may be asking a lot of questions (and I apologize if I am).
However, I do learn quickly and I’m the type of person that likes to
dive in and get started. I’ve read one full ruby book and am midway
through my first rails book. However, even these books do not provide
me scenario based examples.

This is why I’m here. I am better at understanding code when I see
code. I don’t mind working through code that contains errors and trying
to get it to work. That just helps me gain an understanding of what
occurs. The API can only be used as a code bits reference. I always
look there first but which code are you looking for? If you know
exactly what method you are going to be working with, looking in the API
and then scouring the web for information is a little easier. In the
case of my example above, I’m not sure which methods I will be working
with exactly to accomplish my task.

Thanks.

By the way Fred,

I really do appreciate you taking the time to help me and isolate some
of my issues. I want to be proactive with my own code and later on with
helping others. My goal is to gain an understanding of best practice
methods and start utilizing those methods in my code from the start.

I want to do whatever it is I need to do to get things going. If you
say I need to go to X site (I’ll go to X site), etc.

I’m very focused at the task at hand.

Hi Fred,

I think I will use this for my find parameter:

start_date = Time.now.beginning_of_week
end_date = Time.now.end_of_week
@rushing_offenses = RushingOffense.find(:all, :conditions =>
[‘created_at > ? and created_at < ?’, start_date, end_date])

That will let me find anything created within the set week. Now I just
have to figure out how to check whether or not it returns nil and create
data…

On Jun 8, 5:11 am, “J. D.” [email protected] wrote:

have to figure out how to check whether or not it returns nil and create
data…

It will never return nil. It will return an array (possibly an empty
one). You might want to set your own timestamp and use that rather
than relying on created at (so that the date is one that is
significant to your data and not just when you happened to run your
scraper)

Fred

Frederick C. wrote:

On Jun 8, 5:11�am, “J. D.” [email protected] wrote:

have to figure out how to check whether or not it returns nil and create
data…

It will never return nil. It will return an array (possibly an empty
one). You might want to set your own timestamp and use that rather
than relying on created at (so that the date is one that is
significant to your data and not just when you happened to run your
scraper)

Fred

Hi Fred,

Yep you were correct. If the query is empty it returns an empty array
[] so I’ll make some checks against that. I’ll also take your advice
and create a new column called compiled_on and associate it to
timestamp.

Thanks.

Okay,

The end result was modifying the model for the table I was working with
to do the following:

class RushingOffense < ActiveRecord::Base
def scrape
offensive_rushing =
Scraper.new(‘http://web1.ncaa.org/mfb/natlRank.jsp?year=2008&rpt=IA_teamrush&site=org’,
‘table’, ‘statstable’, ‘//tr’)
offensive_rushing.scrape_data
offensive_rushing.clean_celldata
start_date = Time.now.beginning_of_week
end_date = Time.now.end_of_week
current_date = Time.now
@rushing_offenses = RushingOffense.find(:all, :conditions =>
[‘compiled_on > ? and compiled_on < ?’, start_date, end_date])
if @rushing_offenses == [] #means we have an empty array
for i in 0…offensive_rushing.numrows-1
puts “Updating Offensive Rushing Statistics for
#{offensive_rushing.rows[i][1]}.”
RushingOffense.create(:rank => offensive_rushing.rows[i][0],
:name => offensive_rushing.rows[i][1],
:games => offensive_rushing.rows[i][2],
:carries => offensive_rushing.rows[i][3],
:net => offensive_rushing.rows[i][4],
:avg => offensive_rushing.rows[i][5],
:tds => offensive_rushing.rows[i][6],
:ydspg => offensive_rushing.rows[i][7],
:wins => offensive_rushing.rows[i][8],
:losses => offensive_rushing.rows[i][9],
:ties => offensive_rushing.rows[i][10],
:compiled_on => current_date)
end
end
if @rushing_offenses != [] #means the current week’s data is not
empty
puts “Current Week’s Data Is Already Populated!”
end
end
end

This code works 100% and doesn’t overlap.

However, if you could take a look at this code and let me know if
there’s something I should change to make it “better” or follow “best
practices” to shorten or make it more efficient, I would be
appreciative.

I feel great now having completed my first difficult action with rails.

Just to throw another spanner in the works for you, I wonder if this
wouldn’t be achieved more easily using scRUBYt!. The latest skimr
branch (GitHub - scrubber/scrubyt at skimr) lets you quite
easily store the results of a scrape directly into an ActiveRecord
model.

Drop me a line if you need me to provide a more concrete example.

Glenn