Forum: Ruby HTML Parser: Which one is better?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
ZHANG Y. (Guest)
on 2007-05-31 11:28
I'm new to Ruby and need to parse some web pages. I googled "ruby HTML
parser" and have found several parser avaliable. They all seem good and
I'm wondering which one is better for me since I'll have to deal with
many pages encoded in different encoding, such UTF-8, GB2312 and GBK(For
Chinese). So please help me. Thanks.
Dick D. (Guest)
on 2007-05-31 11:48
(Received via mailing list)
Hpricot is a good starting point.
ZHANG Y. (Guest)
on 2007-05-31 13:27
Dick D. wrote:
> Hpricot is a good starting point.

OK, I got it. Thanks a lot.
Richard C. (Guest)
on 2007-05-31 13:37
(Received via mailing list)
On 5/31/07, Dick D. <removed_email_address@domain.invalid> wrote:
> Hpricot is a good starting point.

Yeah Hpricot is good, but in general the quality of the Ruby web
scraping
choices is pretty impressive. There are variants that are just built on
top
of Hpricot but provide an even simpler API.

However your second problem is a bit trickier, where you encounter
alternate encodings. To do any kind of real work with multiple code
pages you want to be converting it to unicode (UTF-8) at fetch time.

This isn't Ruby's strong point (which is not the same thing as saying
it can't do it). But there are multiple choices here - running Ruby on
JRuby (Java) just for the seamless unicode/codepage support. Hpricot
is ported to JRuby for instance. I would have a good look at what
Ruby libraries enable explicit code page conversions.
Sy Y. (Guest)
on 2007-06-01 05:11
Rubyful soup I like. Its highly simple to use although the construction
of the object from HTML is a bit slower than I'd like. Quite easy to
use.
Erik H. (Guest)
on 2007-06-01 10:15
(Received via mailing list)
On 2007-05-31 02:36:57 -0700, "Richard C."
<removed_email_address@domain.invalid> said:

>
I've had great success with this. Just make sure you're using a later
version of Ruby 1.8.5+ (that includes the NKF library) and you should
be fine.
ZHANG Y. (Guest)
on 2007-06-02 09:12
Thank you all for your help.
Jerry Blanco (Guest)
on 2007-06-02 23:35
(Received via mailing list)
I've used HPricot, and really like it.
This topic is locked and can not be replied to.