Wikipedia Parser

david · April 12, 2007, 8:25pm

I need to parse and redisplay in html wikipedia articles (formatted
with the wikipedia style). Has anyone encountered such a library in
ruby ? Any libraries that are good at that?

Thanks

david · April 12, 2007, 8:41pm

David wrote:

I need to parse and redisplay in html wikipedia articles (formatted
with the wikipedia style). Has anyone encountered such a library in
ruby ? Any libraries that are good at that?

Thanks

Check out
http://shanesbrain.net/articles/2006/10/02/screen-scraping-wikipedia
Makes it dead easy to roll your own.
Chris

http://www.autopendium.co.uk
Stuff about old cars

david · April 12, 2007, 8:47pm

Usually you shouldn’t use bots on wikipedia, but should download the
free database instead and use that.
Read about their policy here:

If you have your own mediawiki install and want to use a bot, you can
check out pywikipedia bot:
pywikibot download | SourceForge.net It’s not in ruby,
but it works great.

david · April 12, 2007, 10:53pm

Actually, I’m not entirely sure that you shouldn’t use bots at all on
the
Wikipedia. According to the link you provided:

“Robots or bots are automatic
processeshttp://en.wikipedia.org/wiki/Process_(computing)that
interact with Wikipedia as though they were human editors”

That last bit sounds like they’re talking about a very specific kind of
bot
and not just a scraper.

RSL

david · April 12, 2007, 11:13pm

I wrote that article a while ago. It’ll be interesting to use
WWW::Mechanize, or better yet, scRUBYt, which use Hpricot in the
backend anyway.

Shane

http://shanesbrain.net

david · April 12, 2007, 11:13pm

“Robots or bots are automatic
processeshttp://en.wikipedia.org/wiki/Process_(computing)that
interact with Wikipedia as though they were human editors.” There’s
nothing against screen-scraping there. That policy is about bots which
edit
content. Otherwise, Google would be breaking WP policy.
This is taking the discussion a little off topic though.
-Nathan

david · April 12, 2007, 11:21pm

If you just need to cache some pages for displaying later, screen
scraping Wikipedia is a good choice compared to downloading the db.
If you’re going to be parsing and redisplaying the content in real
time that is against Wikipedia’s policy.

See Wikipedia, the free encyclopedia
Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.or
g_at_runtime.3F

Wikipedia Parser

Check out http://shanesbrain.net/articles/2006/10/02/screen-scraping-wikipedia Makes it dead easy to roll your own. Chris

Check out
http://shanesbrain.net/articles/2006/10/02/screen-scraping-wikipedia
Makes it dead easy to roll your own.
Chris