Different code for each record, how to implement?

Hi,

I want to screen scrape information from some websites (I have
permission to do it).

I am using the Mechanize plugin. The websites are different from each
other, so I need to write a different RoR code to screen scrape each
website. There would be hundreds of different websites.

Ok, the problem is that I don’t know how to implement this in an
elegant and efficient way. My current quick and dirty solution is a
model that I call when I want to screen scrape a website:

I call it like: Spider.crawl(website_id)

It looks like:

class Spider < ActiveRecord::Base

require ‘mechanize’

def crawl(website_id)

      if(website_id == 1)
             //Mechanize code for screen scraping website 1
      end

      if(website_id == 2)
             //Mechanize code for screen scraping website 2
      end

       .....

end

end

How can I improve that?
Is there at least a way to put the code for each website in an
external file, so then I can call just the code I need? That way I
would avoid working with a model that has thousands of lines…

Thanks for your help!

Here are my, off the top of my head suggestions:

Different thor scripts for each website, perhaps a single script to
call the rest of them.

I did something similar for scraping shopping cart information. Since
I needed the same data on every page I wrote a generic crawler which
would read the XPath string from the database for each item I wanted
to scrape. Worked well.

On 12 July 2011 10:02, aupayo [email protected] wrote:

elegant and efficient way. My current quick and dirty solution is a
def crawl(website_id)

end

end

How can I improve that?
Is there at least a way to put the code for each website in an
external file, so then I can call just the code I need? That way I
would avoid working with a model that has thousands of lines…

If you just want to split it up then provide a set of models (not
based on ActiveRecord), one for each site and call the scrape method
from your switch list (which would be better as a case statement). If
you derive them all from a common base then you can put any common
code in the base.

Colin

On Tue, Jul 12, 2011 at 2:02 AM, aupayo [email protected] wrote:

elegant and efficient way. My current quick and dirty solution is a
def crawl(website_id)

Thanks for your help!

Hi, you can define a base class which contains all the common
information
for all your sites. Then you can define a subclass for easy site that
inherits from the base class. For example,

class Site

attr_accessor :name

def to_s
puts “using #{self.class}#to_s
end

def crawl
puts “using #{self.class}#crawl
end

end

class HerSite < Site
def crawl
puts “using #{self.class}#crawl version 1”
end
end

class HisSite < Site
def crawl
puts “using #{self.class}#crawl version 2”
end
end

Next, you can define a SiteFactory class for creating an instance of the
given class which represents our site. Thus, this can be represented
as follows:

class SiteFactory

def create( site )
site.new
end

end

We can define our Spider class that has single class method that takes
an
instance of a site and invokes its crawl instance method.

class Spider

def self.crawl_site( site )
site.crawl
end

end

Putting it all together, we can crawl all of our sites by doing the
following:

site_factory = SiteFactory.new

[ HerSite, HisSite ].each do | klass |
site = site_factory.create( klass )
Spider.crawl_site( site )
end

Finally, anytime you want to add a new site you just create a class that
inherits from class Site that has a single instance called crawl that
describes
its strategy for navigating the site. There’s an easier way to obtain
all
the classes that inherit class Site and I leave this as an exercise for
you.

Good luck,

-Conrad

On 07/12/2011 08:42 AM, Conrad T. wrote:

The above class can be refactored as to the following:

class SiteFactory
def self.create( site )
site.new
end
end

I’m just curious, what exactly is the point of this class?

Now, we can rewrite our calling routine to the following:

[ HerSite, HisSite ].each do | klass |
site = SiteFactory.create( klass )
Spider.crawl_site( site )
end

Seems needlessly verbose, why not just get rid of the factory that isn’t
doing anything and just do…

 [ HerSite, HisSite ].each do | klass |
    Spider.crawl_site(klass.new)
 end

In fact, why not just…

 Site.subclasses.each { | klass | Spider.crawl_site(klass.new) }

Forgive me, I’m a Smalltalker, but this whole explicit factory business
and explicit arrays of classes just looks too Java’ish in an object
system with meta classes and reflection. Is there some reason you
wouldn’t just reflect the subclasses? Is there some reason for a
factory that does nothing? Even if you need a factory, why wouldn’t you
just use class methods on Site?


Ramon L.
http://onsmalltalk.com

On Tue, Jul 12, 2011 at 9:46 AM, Ramon L.
[email protected]wrote:

Site.subclasses.each { | klass | Spider.crawl_site(klass.new) }

Yes, the above is possible but I can see where just getting all the
subclasses of an
class might night be what you want.

Forgive me, I’m a Smalltalker, but this whole explicit factory business and
explicit arrays of classes just looks too Java’ish in an object system with
meta classes and reflection. Is there some reason you wouldn’t just reflect
the subclasses? Is there some reason for a factory that does nothing? Even
if you need a factory, why wouldn’t you just use class methods on Site?

Next, the Ruby language 1.9.2/1.9.3dev doesn’t support a built in method
called subclasses like Smalltalk. Thus, one could implement a
subclasses
method in the Ruby language as follows:

class Class
def subclasses
ObjectSpace.each_object(Class).select { |klass| klass < self } #
select
all the methods that are derived from self (i.e. Site).
end
end

This requires opening a class called Class and defining a method called
subclasses. Furthermore, one can use a built in Ruby hook method call
inherited to arrive at the same result. For example,

class Site

@subclasses = []

class << self
attr_reader :subclasses
end

def self.inherited( klass )
@subclasses << klass
end

def to_s
puts “using #{self.class}#to_s
end

def crawl
puts “using #{self.class}#crawl version 0”
end

end

Ramon, you’re correct in saying that SiteFactory class could be remove
for a
much more concise solution.

-Conrad

On Tue, Jul 12, 2011 at 8:33 AM, Conrad T. [email protected]
wrote:

require ‘mechanize’

would avoid working with a model that has thousands of lines…
attr_accessor :name

end

end

The above class can be refactored as to the following:

class SiteFactory
def self.create( site )
site.new
end
end

end

Now, we can rewrite our calling routine to the following:

[ HerSite, HisSite ].each do | klass |
site = SiteFactory.create( klass )
Spider.crawl_site( site )
end

Enjoy,

-Conrad

ps: There’s always something you missed after you click send.

Thank you all so much. I did it like you said, with a set of models
not based on ActiveRecord.

Best regards,

Cristbal