I am having trouble scrubbing a page that has bad markup. After
fetching the page, the Scrubyt::Extractor exits while parsing the
document. The Apple Safari web inspector shows numerous errors from the
page:
is not allowed inside
. Moving into the .
Unmatched encountered. Ignoring tag.
Unmatched encountered. Ignoring tag.
Unmatched encountered. Ignoring tag.
Is there anyway to scrub a page with scrubyt that is poorly formated? I
am using the latest version (0.4.1) of scrubyt.
Thanks,
Rolin
|
On Apr 27, 2009, at 23:39 , Rolin Nelson wrote:
Is there anyway to scrub a page with scrubyt that is poorly
formated? I
am using the latest version (0.4.1) of scrubyt.
switch to mechanize and update your gems. scrubyt depends on hpricot
and a very old version of mechanize. Mechanize now uses nokogiri
instead of hpricot and is much more resilient with errors.
Ryan D. wrote:
On Apr 27, 2009, at 23:39 , Rolin Nelson wrote:
Is there anyway to scrub a page with scrubyt that is poorly
formated? I
am using the latest version (0.4.1) of scrubyt.
switch to mechanize and update your gems. scrubyt depends on hpricot
and a very old version of mechanize. Mechanize now uses nokogiri
instead of hpricot and is much more resilient with errors.
Thank you, I will try to use Mechanize directly. However, when I
installed scrubyt 0.4.1 it did appear to have a dependency on nokogiri.
I’ve cut and pasted the standard output.
$ sudo gem install scrubyt-0.4.11.gem
Password:
Building native extensions. This could take a while…
Successfully installed scrubyt-0.4.1
Successfully installed nokogiri-1.2.3
2 gems installed
Installing ri documentation for scrubyt-0.4.1…
Installing ri documentation for nokogiri-1.2.3…
Installing RDoc documentation for scrubyt-0.4.1…
Installing RDoc documentation for nokogiri-1.2.3…