On Mon, Oct 11, 2010 at 6:25 PM, flebber [email protected] wrote:
I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.
By “extract data from websites” I assume you mean screen scraping. Here
two Railscasts about it
Some Nokogiri tutorials about it
Some Mechanize tutorials about it (you will only need to use Mechanize
you need to interact with the site, it uses Nokogiri under the covers.
Depending on what site you’re trying to get info from, it might have an
and there might even be a gem for interacting with that API, and saving
yourself the headache and brittleness of screen scraping.
For outputing to XML, Nokogiri can do that. If you have difficulty
it installed, I’ve also enjoyed using Hpricot (aside from api, the
difference is that Nokogiri is built on libxml2, an open source very
C library, while Hpricot is built on a Ragel parser), and if you have
difficulty with that as well, the standard library provides one called
Also consider YAML, which is built into the stdlib, (but has difficulty,
found, dealing with huge data sets).
There are a couple of gems for JSON, I can’t remember which one I’ve
For CSV, the fastercsv gem.
I am almost certain there are tools for interacting with Excel, but I’m
Mac, so not able to really help there.
Depending on what you’re doing, you may not need the intermediate form
human readable (maybe you just need to perpetuate an array of strings
between runnings of your script, or something like that). If that is the
case, you can just marshall the data.
http://ruby-doc.org/core/classes/Marshal.html Probably the easiest
and really fast, but it means your data is Ruby.
For dealing with databases, ActiveRecord, DataMapper, and Sequel should
able to help you out.
ActiveRecord is extremely mature as it’s the de facto Rails M in its
but it requires a little bit of infrastructure to get going outside of
Rails. If you want to use it, http://guides.rubyonrails.org/ is, IMO,
best resource. There are also lots of Railscasts that deal with it (note
that AR3 just released, so the interface is a little different).
DataMapper is another nice project, I like it because you can do it all
one file without migrations (easy to get up and going) you literally
your schema in your code. It has some other nice features such as
guaranteeing that there will only ever be one instance of your DB rows
memory at a time (you can find yourself in some wonky situations with
where it has cached results, or you load the same data twice, and the
unaware of the other). It also has a cool solution to the n+1 problem,
it will preload data as soon as it recognizes you’re going to query for
in a loop. Unfortunately, it’s nowhere near as mature as ActiveRecord. I
finally ended up switching my last project off of DataMapper and onto
ActiveRecord after too many headaches dealing with polymorphism,
libraries for it (I needed tagging), and dissatisfaction with the IRC
channel. If you don’t need external libraries like that, you probably
experience such frustrations. If you’re interested in it, it has some
tutorials on its site http://datamapper.org/docs/ I also really liked
Peepcode about Sinatra, which uses DataMapper to talk to its database.
I’ve not used Sequel, but I’ve seen its creator present at Ruby Midwest.
really knows his stuff. I’ve also only heard good things about the
project, such as actively developed, and easy to get support for. But my
understanding is that it’s main strength is in connecting to “non
opinionated” (what AR would call “legacy”) databases. If you have the
ability to design yours from the beginning, some of it’s strengths might