Wikipedia Parser


#1

I need to parse and redisplay in html wikipedia articles (formatted
with the wikipedia style). Has anyone encountered such a library in
ruby ? Any libraries that are good at that?

Thanks


#2

David wrote:

I need to parse and redisplay in html wikipedia articles (formatted
with the wikipedia style). Has anyone encountered such a library in
ruby ? Any libraries that are good at that?

Thanks

Check out
http://shanesbrain.net/articles/2006/10/02/screen-scraping-wikipedia
Makes it dead easy to roll your own.
Chris

http://www.autopendium.co.uk
Stuff about old cars


#3

Usually you shouldn’t use bots on wikipedia, but should download the
free database instead and use that.
Read about their policy here:
http://en.wikipedia.org/wiki/Wikipedia:Bots

If you have your own mediawiki install and want to use a bot, you can
check out pywikipedia bot:
http://sourceforge.net/projects/pywikipediabot/ It’s not in ruby,
but it works great.


#4

Actually, I’m not entirely sure that you shouldn’t use bots at all on
the
Wikipedia. According to the link you provided:

Robots or bots are automatic
processeshttp://en.wikipedia.org/wiki/Process_(computing)that
interact with Wikipedia as though they were human editors”

That last bit sounds like they’re talking about a very specific kind of
bot
and not just a scraper.

RSL


#5

I wrote that article a while ago. It’ll be interesting to use
WWW::Mechanize, or better yet, scRUBYt, which use Hpricot in the
backend anyway.

Shane

http://shanesbrain.net


#6

Robots or bots are automatic
processeshttp://en.wikipedia.org/wiki/Process_(computing)that
interact with Wikipedia as though they were human editors.” There’s
nothing against screen-scraping there. That policy is about bots which
edit
content. Otherwise, Google would be breaking WP policy.
This is taking the discussion a little off topic though.
-Nathan


#7

If you just need to cache some pages for displaying later, screen
scraping Wikipedia is a good choice compared to downloading the db.
If you’re going to be parsing and redisplaying the content in real
time that is against Wikipedia’s policy.

See http://en.wikipedia.org/wiki/
Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.or
g_at_runtime.3F