Looking for an HTML parser

nuno · September 1, 2006, 5:27pm

Hello, I’m looking for an HTML parser that can handle bad formed input
(unclosed tags).

There’s a pretty good HTML parser in RoR ActionPack but it’s doesn’t
handle bad formed documents

Thanks

nuno · September 1, 2006, 6:06pm

Hey nuno,
Urm, okay, call me stupid Ishmael, but, why not merely subclass the
current htmlparser and then whenever you get a ‘bad tag’ do whatever you
want to do with it. I dare say that if someone passes me a badly formed
document, I -want- them to see an error, however whatever -you- decide
to do with it is upto (well) -you-. If you want to try and ‘fix’ certain
errors in a bad document, thats surely down to ‘you’

You may get lucky and someone may have already trod this path, but,

surely in the case of ‘bad data’ your not best placed to say whats
‘valid’ and whats not. surely thats something only the originating user
can do. Mean to say, you can deal with things like a missing ‘>’ fairly
simply, but what about character transposition ? inptu instead of input,
or character addition instead of …

I think the -saniest- thing a parser can do, is raise an error on

badly formed. Perhaps not the answer you want, and I look forward to
being proved ‘wrong’ but, well, polite shrug there’s my 2c ;p
Regards
Stef

nuno · September 1, 2006, 6:07pm

nuno wrote:

Hello, I’m looking for an HTML parser that can handle bad formed input
(unclosed tags).

There’s a pretty good HTML parser in RoR ActionPack but it’s doesn’t
handle bad formed documents

Thanks

Try scrapi:

http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/

Also, you can use HTMLTidy to clean it up.

Personally, I use rubyful_soup but that’s because I had already
implemented it before finding out about scrapi.

Regards,

Michael

nuno · September 1, 2006, 6:28pm

Hi,

nuno [email protected] writes:

Hello, I’m looking for an HTML parser that can handle bad formed input
(unclosed tags).

did you try this one?

http://mechanize.rubyforge.org/

–
\ / http://www.hashbang.de
/lad http://www.1-cat.de

nuno · September 1, 2006, 6:09pm

“I dare say that if someone passes me a badly formed
document, I -want- them to see an error, however whatever -you- decide
to do with it is upto (well) -you-. If you want to try and ‘fix’ certain
errors in a bad document, thats surely down to ‘you’”

****Usually when you are scraping you don’t have control over the
content so you have to take what is given to you and do the best you can
do with it. I believe HTMLTidy will clean up malformed documents.

Regards,

Michael

nuno · September 1, 2006, 6:36pm

nuno wrote:

Hello, I’m looking for an HTML parser that
can handle bad formed input (unclosed tags).

HTML Tidy might be what you’re looking for.

http://www.google.com/search?hl=en&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=html+tidy&spell=1

hth,
Bill

nuno · September 1, 2006, 6:38pm

Hello Michael,
Whereas I agree with you in regards to the whole ‘you cant control
someone elses webpage when they dont conform to the standard’, I do
think that if you scraping a webpage, you don’t really want to fling it
into an HTMLParser anyway. surely its much quicker to treat the html as
a ‘string’ and then regex out what you need ?

Of course, this is probably either my perl background,rampant

pragmatism or bad programming showing … but … whenever I have wanted
to check the ‘well formed-ness’ of a document, its almost usually been
‘uploaded’ to the system I am using. So, thats where I base my whole
‘fling an error on error’ practice from So, in essence, I guess it
depends what the user is using the HTMLParser ‘for’

Regards
Stef

nuno · September 1, 2006, 6:50pm

nuno <rails-mailing-list@…> writes:

Hello, I’m looking for an HTML parser that can handle bad formed input
(unclosed tags).

There’s a pretty good HTML parser in RoR ActionPack but it’s doesn’t
handle bad formed documents

Thanks

Just a technical point: Unclosed tags are not badly formed in HTML,
they are
exactly the right way to do things in HTML. HTML is not supposed to be
an XML
based language, and self-closing tags is invalid.

That said, I agree with the person who said it’s better to just treat it
a one
long string and regex it.

nuno · September 2, 2006, 11:32am

Thanks for your answers ! scrapi seems to be all I need …

nuno · September 4, 2006, 2:43pm

On Sep 1, 2006, at 12:39 PM, Gareth A. wrote:

exactly the right way to do things in HTML. HTML is not supposed
to be an XML
based language, and self-closing tags is invalid.

Consider:

a
b
c
d
- e
- f
- g
- h
  (BTW, the OP said ‘unclosed tags’ not ‘self-closing tags’ (by which I
  think you mean empty tags))
  
  More importantly, this illustrates an ambiguity that makes dealing
  with ill-formed html difficult, even with a regex. What was meant? a
  nested list or two separate lists? indentation suggests one thing,
  but a peak in a browser another. But surely the author looked at the
  page in the browser and saw that it was okay. Right, surely. But with
  a little CSS who knows what was seen.
  
  Tools like Tidy will turn that example into:
  - a
  - b
  - c
  - d
    - e
    - f
    - g
    - h
  which is probably how a browser would interpret it. Some of the other
  tools will do something similar when parsing it.
  
  Cheers,
  Bob
  
  Bob H. – blogs at <http://www.recursive.ca/
  hutch/>
  Recursive Design Inc. – http://www.recursive.ca/
  Raconteur – http://www.raconteur.info/
  xampl for Ruby – http://rubyforge.org/projects/xampl/

nuno · September 4, 2006, 5:13pm

Google for Rubyful Soup - it’s a port (by the original author) of the
excellent Python parser “Beautiful Soup”, which is explicitly designed
to deal with messy, badly-formed, awkward HTML - ie, the real-world
examples of it.

nuno · September 4, 2006, 2:44pm

Why the Lucky Stiff has a great parser, hpricot
http://code.whytheluckystiff.net/hpricot/

If you need to follow links or fill out forms as well, the trunk
of mechanize can use hpricot as it’s parser. Deadly combo!

joshua