Web scraping article, episode 1

Hi all,

Once upon the time I wrote a silly little article on web scraping with


The article got very popular somehow, so I have decided to continue with
it - and since a lot of people kept me asking for the next installment
that I promised at the end of the first part, I would like to announce
it also here:


The article is quite different from my original plans, (and hence I
guess from the expectations) because something happened in between -
well, read the article and you will see what :slight_smile:



Try Dapper (http://www.dappit.com/), it may turn your screen scraping
problem into an xml parsing problem.


wooops ignore my last comment, it was to the completely wrong thread

many thanks for both of your replies

I have been playing around with dapper, and while I liked the idea, and
also the GUI (scRUBYt! currently does not have any kind of GUI, so
dappit is a clear winner here) I found several problems.

First of all I could not reliably scrape everything I wanted as I wanted
(i.e. 100% accuracy, every records found ect) nearly on any page I tried

  • which is not the case with scRUBYt!. Of course there are bugs and
    problems and needed enhancements in scRUBYt! too, but I have total
    control over these (and anybody who is able to hack with Ruby on an
    intermediate level). Besides this, I have the extractor - I know what’s
    happening all the time. And if this is still not enough I can sprinkle
    the whole stuff with pure Ruby code.

Then I don’t really like the model that your extractor is on the server

  • what if you would like to scrape confidential data, or you are logging
    in to sites with passwords, or to your banking account or …

The idea is really neat and I am sure dapper has a lot of use cases, but
it’s quite a different product with different philosophy and target
audience compared to scRUBYt! Shortly, you should use the right tool for
the right job - and for the things I would like to scrape, scRUBYt! is
much better suited. I am sure dapper is great for other kind of things,
so if you are into those, it’s a great tool!