I'm new to Ruby and need to parse some web pages. I googled "ruby HTML parser" and have found several parser avaliable. They all seem good and I'm wondering which one is better for me since I'll have to deal with many pages encoded in different encoding, such UTF-8, GB2312 and GBK(For Chinese). So please help me. Thanks.
on 2007-05-31 09:28
on 2007-05-31 09:48
Hpricot is a good starting point.
on 2007-05-31 11:27
Dick Davies wrote: > Hpricot is a good starting point. OK, I got it. Thanks a lot.
on 2007-05-31 11:37
On 5/31/07, Dick Davies <firstname.lastname@example.org> wrote: > Hpricot is a good starting point. Yeah Hpricot is good, but in general the quality of the Ruby web scraping choices is pretty impressive. There are variants that are just built on top of Hpricot but provide an even simpler API. However your second problem is a bit trickier, where you encounter alternate encodings. To do any kind of real work with multiple code pages you want to be converting it to unicode (UTF-8) at fetch time. This isn't Ruby's strong point (which is not the same thing as saying it can't do it). But there are multiple choices here - running Ruby on JRuby (Java) just for the seamless unicode/codepage support. Hpricot is ported to JRuby for instance. I would have a good look at what Ruby libraries enable explicit code page conversions.
on 2007-06-01 03:11
Rubyful soup I like. Its highly simple to use although the construction of the object from HTML is a bit slower than I'd like. Quite easy to use.
on 2007-06-01 08:15
On 2007-05-31 02:36:57 -0700, "Richard Conroy" <email@example.com> said: > I've had great success with this. Just make sure you're using a later version of Ruby 1.8.5+ (that includes the NKF library) and you should be fine.
on 2007-06-02 07:12
Thank you all for your help.
on 2007-06-02 21:35
I've used HPricot, and really like it.