Using Scrubyt on bad markup pages


#1

I am having trouble scrubbing a page that has bad markup. After
fetching the page, the Scrubyt::Extractor exits while parsing the
document. The Apple Safari web inspector shows numerous errors from the
page:

is not allowed inside . Moving into the . Unmatched encountered. Ignoring tag. Unmatched encountered. Ignoring tag. Unmatched encountered. Ignoring tag.

Is there anyway to scrub a page with scrubyt that is poorly formated? I
am using the latest version (0.4.1) of scrubyt.

Thanks,
Rolin


#2

On Apr 27, 2009, at 23:39 , Rolin Nelson wrote:

Is there anyway to scrub a page with scrubyt that is poorly
formated? I
am using the latest version (0.4.1) of scrubyt.

switch to mechanize and update your gems. scrubyt depends on hpricot
and a very old version of mechanize. Mechanize now uses nokogiri
instead of hpricot and is much more resilient with errors.


#3

Ryan D. wrote:

On Apr 27, 2009, at 23:39 , Rolin Nelson wrote:

Is there anyway to scrub a page with scrubyt that is poorly
formated? I
am using the latest version (0.4.1) of scrubyt.

switch to mechanize and update your gems. scrubyt depends on hpricot
and a very old version of mechanize. Mechanize now uses nokogiri
instead of hpricot and is much more resilient with errors.

Thank you, I will try to use Mechanize directly. However, when I
installed scrubyt 0.4.1 it did appear to have a dependency on nokogiri.
I’ve cut and pasted the standard output.

$ sudo gem install scrubyt-0.4.11.gem
Password:
Building native extensions. This could take a while…
Successfully installed scrubyt-0.4.1
Successfully installed nokogiri-1.2.3
2 gems installed
Installing ri documentation for scrubyt-0.4.1…
Installing ri documentation for nokogiri-1.2.3…
Installing RDoc documentation for scrubyt-0.4.1…
Installing RDoc documentation for nokogiri-1.2.3…