Forum: Ruby htmltokenizer bug?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
81cf8dab4b4af8aa3148c28421afd845?d=identicon&s=25 hsanson (Guest)
on 2005-11-28 10:33
(Received via mailing list)
I am using htmltokenizer to extract the links of some web pages, my
script
worked perfectly until I started to parse pages with "<" and ">" chars
in the
text.

a html string like this

<a href="an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....


Is there a patch or any way to make htmlparser to parse this text??


regards,
Horacio
A52b0e1c5d982f2512a03c5dbfd033d6?d=identicon&s=25 rasputnik (Guest)
on 2005-11-28 11:30
(Received via mailing list)
On 28/11/05, Horacio Sanson <hsanson@moegi.waseda.jp> wrote:
>
>
> Is there a patch or any way to make htmlparser to parse this text??

I think most *browsers* would choke on that :)

Have you tried using entities instead ?

( &lt; instead of < and &gt; instead of >)
5da4c52f43677f395aff5bde775593c2?d=identicon&s=25 Daniel Schierbeck (dasch)
on 2005-11-28 13:56
(Received via mailing list)
Horacio Sanson wrote:
>
> Is there a patch or any way to make htmlparser to parse this text??
>
>
> regards,
> Horacio
>
>

Your HTML isn't valid. Either use the proper entities (< = &lt; and > =
&gt;) or make a CDATA section, though the latter isn't really that
well-supported in most browsers.

   <a href="an_uri"><![CDATA[this is a <link>]]></a>


Cheers,
Daniel
81cf8dab4b4af8aa3148c28421afd845?d=identicon&s=25 hsanson (Guest)
on 2005-11-28 14:44
(Received via mailing list)
Well the problem is that this HTML is not mine, retrieving the pages
from the
Internet.


Guess I will skip this page from my script.

thanks,
Horacio

Monday 28 November 2005 21:52ã?Daniel Schierbeck ã?ã??はæ?¸ãã¾ã?ã?:
6a480fcc49315c993fc2c4f37c882133?d=identicon&s=25 daniel.amelang (Guest)
on 2005-12-02 02:18
(Received via mailing list)
Sorry for the late reply.

I'm surprised no one mentioned RubyfulSoup:

http://www.crummy.com/software/RubyfulSoup/

If I understand your problem correctly, it's exactly what you need: a
forgiving html parser.

Dan
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 james_b (Guest)
on 2005-12-02 03:30
(Received via mailing list)
Daniel Amelang wrote:
> Sorry for the late reply.
>
> I'm surprised no one mentioned RubyfulSoup:
>
> http://www.crummy.com/software/RubyfulSoup/
>
> If I understand your problem correctly, it's exactly what you need: a
> forgiving html parser.


I recently tried using RubyfulSoup to parse a Web page, and it had some
peculiar behavior, such as stripping all attributes.  Either I was not
using it correctly, or it was a bit too casual in making sense of the
input.

I ended up using some crude string parsing to extract just the subset of
the page I wanted, which gave me well-formed XML suitable for REXML
manipulation.   I got a phenomenal speed increase from that as well;
RubyfulSoup seems quite slow.


James
--

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
This topic is locked and can not be replied to.