Where to store scrape results?


#1

Given a users search query my app goes off and scrapes a few sites and
provides the results to the user. The user can also choose to filter
these results even further by category, age etc and this will be updated
via ajax without refreshing. All result items are not static. Except for
its title the information for one item will change every 2 hours so
theres no point in caching the data to a database.

Given that i want to allow filtering of the results how should i go
about storing the results after scraping? There will be at most about
1000 results each comprising about 300chars.

Can i just store them in a @@results variable? How do i overcome the
wiping of the data whilst in development mode?

im new to rails but ive also read stuff on sessions, memcache etc but
not really sure if they are whats needed for this situtation?

Can anyone help?


#2

On Sun, Apr 5, 2009 at 5:33 AM, Adam A.
removed_email_address@domain.invalid wrote:

1000 results each comprising about 300chars.

Why not write the results to a file.
You could write the raw (pre-scraped) data to a file and re-scrape it
or you could save the data structure in some format (YAML is an option
here)

Andrew T.
http://ramblingsonrails.com
http://www.linkedin.com/in/andrewtimberlake

“I have never let my schooling interfere with my education” - Mark Twain


#3

You can easily create a table, and stick it in as a row.
in rails sqlite is easy enough, if you site is bigger you
can use db2.
If its like most sites, you make a “result” table
that is associated to a user table.


#4

Hi thanks for your replies.

My main concern is performance. The data is not scraped beforehand in
advance, its scraped on demand by my users. They submit a search query
whch i then perform on several site, scrape their results and aggregate
them for the user. My site is basically a meta search engine.

Storing results in a db

pros: i get to use msql find conditions when the user wants to filter
the results even more.

cons: ill only be temporarily storng these results. As soon as the user
does a new search there gone forever. I dont know the peformance hit of
storing a 1000 results in a db consisting of several fields. Is a db
still a wise choice?

Using YAML:

pros: not sure, but hey, i like using it!
cons: no msql conditions so id have to create my own methods

does the above change anything?


#5

On Sun, Apr 5, 2009 at 10:31 AM, Adam A.
removed_email_address@domain.invalid wrote:

pros: i get to use msql find conditions when the user wants to filter
cons: no msql conditions so id have to create my own methods

does the above change anything?


Posted via http://www.ruby-forum.com/.

The benefit of YAML is that once you’ve scraped the data, you probably
already have a structure in place which can easily be saved and
restored.
You could combine the two by storing the YAML in the database.

From a performance perspective, consider caching the results of the
scraping for at least some period of time so that you don’t have to
scrape on every search (unless the source websites change VERY
frequently)

Andrew T.
http://ramblingsonrails.com
http://www.linkedin.com/in/andrewtimberlake

“I have never let my schooling interfere with my education” - Mark Twain


#6

On Sun, Apr 5, 2009 at 12:17 PM, Adam A.
removed_email_address@domain.invalid wrote:

the cache in the db and grab the yaml. Then use yaml to turn the info
Posted via http://www.ruby-forum.com/.

Sounds good to me.
I always focus on getting the job done in the simplest way possible
first. Then work on optimisation if you see a bottleneck.
Your biggest problem is likely to be fetching all the other sites for
scraping which caching will hopefully help with.

Andrew T.
http://ramblingsonrails.com
http://www.linkedin.com/in/andrewtimberlake

“I have never let my schooling interfere with my education” - Mark Twain


#7

Thanks Andrew for ruling out any doubts i had regarding using yaml.

I will cache the reuslts then for around 2 hours in a db.

Im now wondering how this will affect the performance of filtering.

My guess is that when a user selects some filters on the results screen,
these get passed as params back to the controllers index. Logic there
will determine its a request to filter existing results and will access
the cache in the db and grab the yaml. Then use yaml to turn the info
into the relevant objects and then use enumerators find_all method to
filter the results…

do you think that approach is ok or is there a better way of doing it?

many thanks once again. you have been a great help.


#8

Excellent thanks once again Andrew! Appreciate your advice.


#9

Just thinking, your scrape should probably be in a worker, stick the
results in
a db, Depending on what your using, you configure it to be a temp
table even.
Then in your search window you can do ajax based updated from the
scrape.
With the ability to then clear up the cache. You get more concurrency,
and with
the right javascript you could cancel the scrape in process.

Think this would scale and be more responsive

On Apr 5, 10:02 pm, Adam A. removed_email_address@domain.invalid


#10

Thanks glennswest, im relatively new to rails. Whilst i think i
understood what you said can you (or anyone else) elaborate furhter on
the points below? I really appreciated your help.

Just thinking, your scrape should probably be in a worker,

when you say a worker i take it you mean some temporary database?

Depending on what your using, you configure it to be a temp
table even.
Then in your search window you can do ajax based updated from the
scrape.

From the above do you mean whilst im scraping results from sites, when
one sites results get added to the db and i go off scraping another
sites results, i can simultaneously show the results that were just
added to the screen?

With the ability to then clear up the cache.

after i get all the results and display them to the screen i can then
clear the table?

You get more concurrency,

Wasnt too sure what you meant by this but thats because im fresh to
rails and cant gather from the context.

and with
the right javascript you could cancel the scrape in process.

ahh so if whilst im scraping and simultaneously presenting already
scraped data from the db, the user decides to cancel the request, via
some javascript call i can terminate the outstanding scrape tasks and
move on?

Think this would scale and be more responsive

In general how fast/slow is it to update a table with around 1000
results? is it fast enough to handle this situation? Id prefer to stick
the objects in a temporary db because then id get to use existing
activerecord methods and mysql statements. Im just worrying about the
performance.


#11

Here’s your problem in rails:

Your web server is “single” threaded, so while you scrapping, its not
doing anything else, so you will
need more mongrels to take care of the users.

Generally you scale by having more threads, and cpu working on the
problem.
The database is probably not going to be your bottleneck for a while,
its more the
style.

Why dont I train you a bit. We can do a screen share/skype session.

On Apr 6, 9:27 pm, Adam A. removed_email_address@domain.invalid


#12

Hi Glennwest,sorry for the late reply. Id be up for chatting over skype
if you are. Let me know either here or via a message. Thank you for your
kind offer!

adam.