Mechanize get methods returns both ::File and ::Page

Hi,

I am having some problems with WWW::Mechanize. When I use the get(url)
method it unpredictably returns either WWW::Mechanize::File or
WWW::Mechanize::Page. Since it’s a HTML page that I am downloading, I
need to always need it to return a page and not a file for what I am
doing. The content type for this page is “text/plain” which I think is
part of the problem which might have something to do with it.

I am looking for a way to guarantee the method returning a page, a
solution for getting the page class from the file class, or how to cast
a file to a page.

Thank,

Carl

http://www.gaihosa.com

Carl B. wrote:

Hi,

I am having some problems with WWW::Mechanize. When I use the get(url)
method it unpredictably returns either WWW::Mechanize::File or
WWW::Mechanize::Page.

Since it’s a HTML page that I am downloading, I
need to always need it to return a page and not a file for what I am
doing. The content type for this page is “text/plain” which I think is
part of the problem which might have something to do with it.

Class WWW::Mechanize::Page
Synopsis
This class encapsulates an HTML page. If Mechanize finds a content type of ‘text/html’, this class >>will be instantiated and returned.

Presumably that means if the content type is not text/html, then a
Page will not be returned. That makes sense since the synopsis says
that a Page encapsulates an HTML page.

WWW::Mechanize::File
If Mechanize cannot find an appropriate class to use for the content type, this class will be used. For >>example, if you download a JPG, Mechanize will not know how to parse it, so this class will be >>instantiated.

Since Mechanize is used to parse forms and html, that makes sense: if
you don’t have an html page(i.e. one with a Content-Type = text/html),
then you can’t parse it as html.

The content type for this page is “text/plain” which I think is
part of the problem which might have something to do with it.

A page with a content type of ‘text/plain’ is telling you that the page
is not html. Are you saying that the page is actually html even though
the page says that it does not contain html?

You have to set plugable paraser for text/plain to html parser
here’s hack (you’ll have to change get to get_html):

class WWW::Mechanize
def get_html(url)
old_parser= @pluggable_parser[‘text/plain’]
@pluggable_parser[‘text/plain’]=@pluggable_parser[‘text/html’]
bdy = get(url)
pluggable_parser[‘text/plain’]=old_parser
bdy
end
end

and other way around (always get file for html)

class WWW::Mechanize
def get_file(url)
old_parser= @pluggable_parser[‘text/html’]
@pluggable_parser[‘text/html’]=::WWW::Mechanize::File
bdy = get(url).body
pluggable_parser[‘text/html’]=old_parser
bdy
end
end

The content type for this page is “text/plain” which I think is
part of the problem which might have something to do with it.

A page with a content type of ‘text/plain’ is telling you that the page
is not html. Are you saying that the page is actually html even though
the page says that it does not contain html?

The page is html. Below, I included the log. It shows the page’s
content type to be “text/html” for the first few attempts and then the
last attempt to be “text/plain”. All I need to know is how to get a
page instead of a file either be extending Mechanize, creating a
instance of WWW::Mechanize::Page with the body from the file object or
some other method as I need to get the links.

Any ideas?

Logfile created on Sun Feb 03 18:20:36 -0500 2008 by logger.rb/1.5.2.9

I, [2008-02-03T18:20:36.381042 #15528] INFO – : Net::HTTP::Get:
/menus.htm
D, [2008-02-03T18:20:36.478723 #15528] DEBUG – : request-header:
accept-language => en-us,en;q=0.5
D, [2008-02-03T18:20:36.478919 #15528] DEBUG – : request-header:
connection => keep-alive
D, [2008-02-03T18:20:36.479000 #15528] DEBUG – : request-header: accept
=> /
D, [2008-02-03T18:20:36.479073 #15528] DEBUG – : request-header:
accept-encoding => gzip,identity
D, [2008-02-03T18:20:36.479147 #15528] DEBUG – : request-header:
user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
D, [2008-02-03T18:20:36.479221 #15528] DEBUG – : request-header:
accept-charset => ISO-8859-1,utf-8;q=0.7,;q=0.7
D, [2008-02-03T18:20:36.479295 #15528] DEBUG – : request-header:
keep-alive => 300
D, [2008-02-03T18:20:36.511382 #15528] DEBUG – : Read 605 bytes
D, [2008-02-03T18:20:36.516205 #15528] DEBUG – : Read 1141 bytes
D, [2008-02-03T18:20:36.516409 #15528] DEBUG – : response-header:
last-modified => Sat, 17 Feb 2007 23:40:30 GMT
D, [2008-02-03T18:20:36.516486 #15528] DEBUG – : response-header:
connection => Keep-Alive
D, [2008-02-03T18:20:36.516559 #15528] DEBUG – : response-header:
content-type => text/html
D, [2008-02-03T18:20:36.516631 #15528] DEBUG – : response-header: etag
=> “4688-475-9e16f780”, “4688-475-9e16f780”
D, [2008-02-03T18:20:36.516702 #15528] DEBUG – : response-header: date
=> Sun, 03 Feb 2008 23:24:11 GMT
D, [2008-02-03T18:20:36.516773 #15528] DEBUG – : response-header:
server => Apache-AdvancedExtranetServer
D, [2008-02-03T18:20:36.516845 #15528] DEBUG – : response-header:
content-length => 1141
D, [2008-02-03T18:20:36.516918 #15528] DEBUG – : response-header:
keep-alive => timeout=15, max=100
D, [2008-02-03T18:20:36.516990 #15528] DEBUG – : response-header:
accept-ranges => bytes, bytes
I, [2008-02-03T18:20:36.517359 #15528] INFO – : status: 200
I, [2008-02-03T18:21:40.578768 #15591] INFO – : Net::HTTP::Get:
/menus.htm
D, [2008-02-03T18:21:40.704310 #15591] DEBUG – : request-header:
accept-language => en-us,en;q=0.5
D, [2008-02-03T18:21:40.704504 #15591] DEBUG – : request-header:
connection => keep-alive
D, [2008-02-03T18:21:40.704582 #15591] DEBUG – : request-header: accept
=> /
D, [2008-02-03T18:21:40.704657 #15591] DEBUG – : request-header:
accept-encoding => gzip,identity
D, [2008-02-03T18:21:40.704732 #15591] DEBUG – : request-header:
user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
D, [2008-02-03T18:21:40.704806 #15591] DEBUG – : request-header:
accept-charset => ISO-8859-1,utf-8;q=0.7,
;q=0.7
D, [2008-02-03T18:21:40.704879 #15591] DEBUG – : request-header:
keep-alive => 300
D, [2008-02-03T18:21:40.740010 #15591] DEBUG – : Read 681 bytes
D, [2008-02-03T18:21:40.740522 #15591] DEBUG – : Read 1141 bytes
D, [2008-02-03T18:21:40.740674 #15591] DEBUG – : response-header:
last-modified => Sat, 17 Feb 2007 23:40:30 GMT
D, [2008-02-03T18:21:40.740755 #15591] DEBUG – : response-header:
connection => Keep-Alive
D, [2008-02-03T18:21:40.740829 #15591] DEBUG – : response-header:
content-type => text/html
D, [2008-02-03T18:21:40.740904 #15591] DEBUG – : response-header: etag
=> “4688-475-9e16f780”, “4688-475-9e16f780”
D, [2008-02-03T18:21:40.740978 #15591] DEBUG – : response-header: date
=> Sun, 03 Feb 2008 23:25:15 GMT
D, [2008-02-03T18:21:40.741053 #15591] DEBUG – : response-header:
server => Apache-AdvancedExtranetServer
D, [2008-02-03T18:21:40.741127 #15591] DEBUG – : response-header:
content-length => 1141
D, [2008-02-03T18:21:40.741200 #15591] DEBUG – : response-header:
keep-alive => timeout=15, max=100
D, [2008-02-03T18:21:40.741273 #15591] DEBUG – : response-header:
accept-ranges => bytes, bytes
I, [2008-02-03T18:21:40.741640 #15591] INFO – : status: 200
I, [2008-02-03T18:21:44.596803 #15596] INFO – : Net::HTTP::Get:
/menus.htm
D, [2008-02-03T18:21:44.664035 #15596] DEBUG – : request-header:
accept-language => en-us,en;q=0.5
D, [2008-02-03T18:21:44.664264 #15596] DEBUG – : request-header:
connection => keep-alive
D, [2008-02-03T18:21:44.664345 #15596] DEBUG – : request-header: accept
=> /
D, [2008-02-03T18:21:44.664417 #15596] DEBUG – : request-header:
accept-encoding => gzip,identity
D, [2008-02-03T18:21:44.664488 #15596] DEBUG – : request-header:
user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
D, [2008-02-03T18:21:44.664559 #15596] DEBUG – : request-header:
accept-charset => ISO-8859-1,utf-8;q=0.7,;q=0.7
D, [2008-02-03T18:21:44.664630 #15596] DEBUG – : request-header:
keep-alive => 300
D, [2008-02-03T18:21:44.698991 #15596] DEBUG – : Read 605 bytes
D, [2008-02-03T18:21:44.701238 #15596] DEBUG – : Read 1141 bytes
D, [2008-02-03T18:21:44.701421 #15596] DEBUG – : response-header:
last-modified => Sat, 17 Feb 2007 23:40:30 GMT
D, [2008-02-03T18:21:44.701496 #15596] DEBUG – : response-header:
connection => Keep-Alive
D, [2008-02-03T18:21:44.701566 #15596] DEBUG – : response-header:
content-type => text/html
D, [2008-02-03T18:21:44.701638 #15596] DEBUG – : response-header: etag
=> “4688-475-9e16f780”, “4688-475-9e16f780”
D, [2008-02-03T18:21:44.701708 #15596] DEBUG – : response-header: date
=> Sun, 03 Feb 2008 23:25:19 GMT
D, [2008-02-03T18:21:44.701779 #15596] DEBUG – : response-header:
server => Apache-AdvancedExtranetServer
D, [2008-02-03T18:21:44.701848 #15596] DEBUG – : response-header:
content-length => 1141
D, [2008-02-03T18:21:44.701919 #15596] DEBUG – : response-header:
keep-alive => timeout=15, max=100
D, [2008-02-03T18:21:44.702133 #15596] DEBUG – : response-header:
accept-ranges => bytes, bytes
I, [2008-02-03T18:21:44.702519 #15596] INFO – : status: 200
I, [2008-02-03T18:21:46.272708 #15602] INFO – : Net::HTTP::Get:
/menus.htm
D, [2008-02-03T18:21:46.332880 #15602] DEBUG – : request-header:
accept-language => en-us,en;q=0.5
D, [2008-02-03T18:21:46.333074 #15602] DEBUG – : request-header:
connection => keep-alive
D, [2008-02-03T18:21:46.333147 #15602] DEBUG – : request-header: accept
=> /
D, [2008-02-03T18:21:46.333218 #15602] DEBUG – : request-header:
accept-encoding => gzip,identity
D, [2008-02-03T18:21:46.333288 #15602] DEBUG – : request-header:
user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
D, [2008-02-03T18:21:46.333360 #15602] DEBUG – : request-header:
accept-charset => ISO-8859-1,utf-8;q=0.7,
;q=0.7
D, [2008-02-03T18:21:46.333431 #15602] DEBUG – : request-header:
keep-alive => 300
D, [2008-02-03T18:21:46.361484 #15602] DEBUG – : Read 0 bytes
D, [2008-02-03T18:21:46.362406 #15602] DEBUG – : Read 948 bytes
D, [2008-02-03T18:21:46.365163 #15602] DEBUG – : Read 1141 bytes
D, [2008-02-03T18:21:46.365336 #15602] DEBUG – : response-header:
last-modified => Sat, 17 Feb 2007 23:40:30 GMT
D, [2008-02-03T18:21:46.365410 #15602] DEBUG – : response-header:
connection => Keep-Alive
D, [2008-02-03T18:21:46.365481 #15602] DEBUG – : response-header:
content-type => text/plain
D, [2008-02-03T18:21:46.368645 #15602] DEBUG – : response-header: etag
=> “4688-475-9e16f780”
D, [2008-02-03T18:21:46.368781 #15602] DEBUG – : response-header: date
=> Sun, 03 Feb 2008 23:25:21 GMT
D, [2008-02-03T18:21:46.368855 #15602] DEBUG – : response-header:
server => Apache-AdvancedExtranetServer
D, [2008-02-03T18:21:46.368927 #15602] DEBUG – : response-header:
content-length => 1141
D, [2008-02-03T18:21:46.368998 #15602] DEBUG – : response-header:
keep-alive => timeout=15, max=100
D, [2008-02-03T18:21:46.369070 #15602] DEBUG – : response-header: age
=> 1
D, [2008-02-03T18:21:46.369141 #15602] DEBUG – : response-header:
accept-ranges => bytes
I, [2008-02-03T18:21:46.369512 #15602] INFO – : status: 200

Since Mechanize is used to parse forms and html, that makes sense: if
you don’t have an html page(i.e. one with a Content-Type = text/html),
then you can’t parse it as html.
yes you can - use plugable parasers

The content type for this page is “text/plain” which I think is
part of the problem which might have something to do with it.

A page with a content type of ‘text/plain’ is telling you that the page
is not html. Are you saying that the page is actually html even though
the page says that it does not contain html?
some webservers are fucked up, and simply don’t care, or admins are
retarded and serv everything as text/plain

Carl B. wrote:

The page is html. Below, I included the log. It shows the page’s
content type to be “text/html” for the first few attempts and then the
last attempt to be “text/plain”.

I’m not sure how showing me the log files is evidence that even though
the page says it is ‘text/plain’ that it really contains html.

All I need to know is how to get a
page instead of a file either be extending Mechanize, creating a
instance of WWW::Mechanize::Page with the body from the file object

Page#new() takes a URI as an argument. So it seems like you could save
the file, and then provide a URI with the file:// scheme and create a
new Page.

Marcin R. wrote:

Since Mechanize is used to parse forms and html, that makes sense: if
you don’t have an html page(i.e. one with a Content-Type = text/html),
then you can’t parse it as html.
yes you can - use plugable parasers

Explain how you would parse plain text such as:

Hi,

My name is Sally.

Yours Truly,
Sally

as html?? What’s the ? Which part is a ?

7stud – wrote:

page instead of a file either be extending Mechanize, creating a
instance of WWW::Mechanize::Page with the body from the file object

Page#new() takes a URI as an argument. So it seems like you could save
the file, and then provide a URI with the file:// scheme and create a
new Page.

or - instead of doing that idiotic hack - you could use pluggable
parasers - feature already built in into mechanize to force it to treat
text/plain like html with simple one liner.

Or use more complex solution that i posted - that forces pluggable
paraser only if you clearly state it for that request using get_html,
and after that it cleans up after itself

7stud – wrote:

My name is Sally.

Yours Truly,
Sally

as html?? What’s the ? Which part is a ?
Did mama hit you in a head when you were young?

Do you know what are MIME encodings?
Servers are requred by http specification to provide mime-encoding - and
content should be interpreted acording to it - if it’s text/plain it
should be just displayed if it’s application/zip then saved etc.

BUT since most servers don’t implement it fully - or have to be
configured - or php cgi script (or ruby for that matter) might alter it

  • and sometimes does - html can be served with mime-type text/plain

since mechanized follows that standard - it assumes that data with mime
text/plain is in fact plain text just like one you provided - but what
if it’s website (which Carl clearly explained in initial post) - then
you have to force it to treat it as html - clear enough?