Ruby Sanity Check

HI

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

I am not nuts am I?

I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.

Ruby provides wonderful tools to do all of this. So no, you’re not
completely nuts :slight_smile:

You could use Nokogiri (http://nokogiri.org/) to extract data from
websites. Ruby (1.9) also has a very nice CSV library that’s fast and
easy to use. It’s also pretty trivial to just go straight into the DB
with Ruby DBI or some of the ORMs like DataMapper and ActiveRecord.

Cheers,
Jason

Nope, you’re quite sane.

Sorry, I should have been a little more helpful:

http://mechanize.rubyforge.org/mechanize/

http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html

http://ar.rubyonrails.org/

On Monday, October 11, 2010 08:25:33 pm flebber wrote:

so.

I was looking at nokogiri, would it be more feature complete than
hpricot?

Possibly. It would also likely be faster and more stable, and I would
guess
it’s more actively maintained.

I was looking at nokogiri, would it be more feature complete than
hpricot?

This question is a bit more complicated than you’d imagine… there’s
a lot of history there.

Either is fine. But you should probably go with nokogiri these days.

On Oct 12, 11:11am, Steve K. [email protected] wrote:

Sorry, I should have been a little more helpful:

http://mechanize.rubyforge.org/mechanize/

http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-12

http://ar.rubyonrails.org/

Thanks for the links I was hoping I was not nuts :-;, I didn’t think
so.

I was looking at nokogiri, would it be more feature complete than
hpricot?

On Mon, Oct 11, 2010 at 6:25 PM, flebber [email protected] wrote:

I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.

By “extract data from websites” I assume you mean screen scraping. Here
are
two Railscasts about it
http://railscasts.com/episodes/173-screen-scraping-with-scrapi
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

Some Nokogiri tutorials about it
http://nokogiri.org/tutorials

Some Mechanize tutorials about it (you will only need to use Mechanize
if
you need to interact with the site, it uses Nokogiri under the covers.
Note
that it can’t handle Javascript, and there are some alternatives if you
need
that)
http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html

Depending on what site you’re trying to get info from, it might have an
API,
and there might even be a gem for interacting with that API, and saving
yourself the headache and brittleness of screen scraping.

For outputing to XML, Nokogiri can do that. If you have difficulty
getting
it installed, I’ve also enjoyed using Hpricot (aside from api, the
biggest
difference is that Nokogiri is built on libxml2, an open source very
popular
C library, while Hpricot is built on a Ragel parser), and if you have
difficulty with that as well, the standard library provides one called
REXML.

Also consider YAML, which is built into the stdlib, (but has difficulty,
I
found, dealing with huge data sets).

There are a couple of gems for JSON, I can’t remember which one I’ve
used.

For CSV, the fastercsv gem.

I am almost certain there are tools for interacting with Excel, but I’m
on a
Mac, so not able to really help there.

Depending on what you’re doing, you may not need the intermediate form
to be
human readable (maybe you just need to perpetuate an array of strings
between runnings of your script, or something like that). If that is the
case, you can just marshall the data.
http://ruby-doc.org/core/classes/Marshal.html Probably the easiest
solution,
and really fast, but it means your data is Ruby.

For dealing with databases, ActiveRecord, DataMapper, and Sequel should
be
able to help you out.

ActiveRecord is extremely mature as it’s the de facto Rails M in its
MVC,
but it requires a little bit of infrastructure to get going outside of
Rails. If you want to use it, http://guides.rubyonrails.org/ is, IMO,
the
best resource. There are also lots of Railscasts that deal with it (note
that AR3 just released, so the interface is a little different).

DataMapper is another nice project, I like it because you can do it all
in
one file without migrations (easy to get up and going) you literally
define
your schema in your code. It has some other nice features such as
guaranteeing that there will only ever be one instance of your DB rows
in
memory at a time (you can find yourself in some wonky situations with
AR,
where it has cached results, or you load the same data twice, and the
one is
unaware of the other). It also has a cool solution to the n+1 problem,
where
it will preload data as soon as it recognizes you’re going to query for
it
in a loop. Unfortunately, it’s nowhere near as mature as ActiveRecord. I
finally ended up switching my last project off of DataMapper and onto
ActiveRecord after too many headaches dealing with polymorphism,
immature
libraries for it (I needed tagging), and dissatisfaction with the IRC
channel. If you don’t need external libraries like that, you probably
won’t
experience such frustrations. If you’re interested in it, it has some
good
tutorials on its site http://datamapper.org/docs/ I also really liked
the $9
Peepcode about Sinatra, which uses DataMapper to talk to its database.
http://peepcode.com/products/sinatra

I’ve not used Sequel, but I’ve seen its creator present at Ruby Midwest.
He
really knows his stuff. I’ve also only heard good things about the
project, such as actively developed, and easy to get support for. But my
understanding is that it’s main strength is in connecting to “non
opinionated” (what AR would call “legacy”) databases. If you have the
ability to design yours from the beginning, some of it’s strengths might
be
necessary.

On Oct 12, 4:41pm, Josh C. [email protected] wrote:

Ruby on Rails (its good to know its there though). I want to use Ruby
By “extract data from websites” I assume you mean screen scraping. Here are
and there might even be a gem for interacting with that API, and saving
found, dealing with huge data sets).
between runnings of your script, or something like that). If that is the
that AR3 just released, so the interface is a little different).
finally ended up switching my last project off of DataMapper and onto
understanding is that it’s main strength is in connecting to “non
opinionated” (what AR would call “legacy”) databases. If you have the
ability to design yours from the beginning, some of it’s strengths might be
necessary.- Hide quoted text -

  • Show quoted text -

Wow thank you!!! Totally beyond expectation. Lets me spend more time
learning than searching, very much appreciated. I will update when I
understand better the tools and what flow I am going to use.

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

If you need a headless browser to scrape websites which do a lot of ajax
stuff, use celerity http://celerity.rubyforge.org/

flebber wrote in post #949224:

HI

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

I am not nuts am I?

I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.

For excel I would reccomend win32ole it is available natively on windows
or you can install on linux through wine. I have no clue about mac.
http://www.perlmonks.org/?node_id=430194

Roo also looks pretty cool but haven’t done anything with it yet since
it can’t write to cells in excel.
http://roo.rubyforge.org

but if you are putting to a database anyways it can read the data and
may be what you are looking for.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs