Htmltokenizer bug?

hsanson · November 28, 2005, 10:33am

I am using htmltokenizer to extract the links of some web pages, my
script
worked perfectly until I started to parse pages with “<” and “>” chars
in the
text.

a html string like this

this is a

causes the htmlparser to raise and exception; Error, tag is nil…

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

hsanson · November 28, 2005, 11:30am

On 28/11/05, Horacio S. [email protected] wrote:

Is there a patch or any way to make htmlparser to parse this text??

I think most browsers would choke on that

Have you tried using entities instead ?

( < instead of < and > instead of >)

hsanson · November 28, 2005, 1:56pm

Horacio S. wrote:

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

Your HTML isn’t valid. Either use the proper entities (< = < and > =
>) or make a CDATA section, though the latter isn’t really that
well-supported in most browsers.

<![CDATA[this is a ]]>

Cheers,
Daniel

hsanson · December 2, 2005, 2:18am

Sorry for the late reply.

I’m surprised no one mentioned RubyfulSoup:

If I understand your problem correctly, it’s exactly what you need: a
forgiving html parser.

Dan

hsanson · December 2, 2005, 3:30am

Daniel A. wrote:

Sorry for the late reply.

I’m surprised no one mentioned RubyfulSoup:

Rubyful Soup: "The brush has got entangled in it!"

If I understand your problem correctly, it’s exactly what you need: a
forgiving html parser.

I recently tried using RubyfulSoup to parse a Web page, and it had some
peculiar behavior, such as stripping all attributes. Either I was not
using it correctly, or it was a bit too casual in making sense of the
input.

I ended up using some crude string parsing to extract just the subset of
the page I wanted, which gave me well-formed XML suitable for REXML
manipulation. I got a phenomenal speed increase from that as well;
RubyfulSoup seems quite slow.

James

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

hsanson · November 28, 2005, 2:44pm

Well the problem is that this HTML is not mine, retrieving the pages
from the
Internet.

Guess I will skip this page from my script.

thanks,
Horacio

Monday 28 November 2005 21:52ã?Daniel S. ã?ã??ã¯æ?¸ãã¾ã?ã?: