I am using htmltokenizer to extract the links of some web pages, my
script
worked perfectly until I started to parse pages with “<” and “>” chars
in the
text.
Is there a patch or any way to make htmlparser to parse this text??
regards,
Horacio
Your HTML isn’t valid. Either use the proper entities (< = < and > =
>) or make a CDATA section, though the latter isn’t really that
well-supported in most browsers.
If I understand your problem correctly, it’s exactly what you need: a
forgiving html parser.
I recently tried using RubyfulSoup to parse a Web page, and it had some
peculiar behavior, such as stripping all attributes. Either I was not
using it correctly, or it was a bit too casual in making sense of the
input.
I ended up using some crude string parsing to extract just the subset of
the page I wanted, which gave me well-formed XML suitable for REXML
manipulation. I got a phenomenal speed increase from that as well;
RubyfulSoup seems quite slow.