Problems with mechanize and fields embedded in tables

I’m working with the following versions:

ruby      1.8.2
libwww-mechanize-ruby  0.6.10

and have run across an odd problem. One site that I’m trying to scrape
has started embedding form fields inside of tables, and mechanize no
longer recognizes them as fields.

The fields are there in the HTML code, but aren’t accessible to
mechanize. I’ve tried a couple of work-arounds, but field_add! doesn’t
seem to support adding check boxes or file upload fields (is there
another way to add them explicitly?), and I can’t see any other way to
find those embedded fields.

If this is a bug in mechanize, how do I report it? If it’s a bug in the
coder, what can I do to resolve the problem?

In the course of debugging, I tried this:

require 'mechanize'
agent = WWW::Mechanize.new
selection='http://seeker.dice.com/jobsearch/servlet/JobSearch?op=302&dockey=xml/4/6/46ab274a1ab667a09cd9aac11c6bef37@endecaindex'
page = agent.get(selection)
page = agent.click page.links.text('Click Here to Apply')
reply_form = page.forms.with.name('APPLICATION_FORM').first
pp reply_form

As you can see, the SEEKER_CC checkbox and RESUME_FILE filename fields
aren’t showing up, but they ARE in the HTML. I suppose it helps if you
have access to the data sources and the methodology of the (error-prone)
programmer that’s accessing them. :slight_smile:

On Tue, Oct 30, 2007 at 12:00:03PM -0700, Todd A. Jacobs wrote:

mechanize. I’ve tried a couple of work-arounds, but field_add! doesn’t
seem to support adding check boxes or file upload fields (is there

I’ve managed to add the fields explicitly:

carbon = WWW::Mechanize::RadioButton.new('SEEKER_CC', nil, true, 

reply_form)
upload_field = WWW::Mechanize::FileUpload.new(‘RESUME_FILE’, ‘foo’)

reply_form.checkboxes.push(carbon)
reply_form.file_uploads.push(upload_field)

but this seems kind of kludgy. I’m still looking for a better way.

Todd A. Jacobs wrote:

aren’t showing up, but they ARE in the HTML. I suppose it helps if you
have access to the data sources and the methodology of the (error-prone)
programmer that’s accessing them. :slight_smile:

It sounds like javascript may be adding the fields you want. When you
load the page in a browser, the browser’s javascript software kicks in
and can add html to the page. However, when you grab a page with
mechanize, you get the pre-javascript page, and as far as I know,
mechanize does not have the ability to interpret the javascript and make
changes to the html based on what the javascript says to do.

Well designed websites design their pages so that users without
javascript enabled are served simpler pages that have all the required
html for forms and the necessary html to navigate around the website.
The trick is getting the server to send you those pages. You have to be
good with html and js and dig around a bit to figure it out. Or, if the
site has a lot of traffic, there might be an article on how to do it.

On 10/30/07, 7stud – [email protected] wrote:

Well designed websites design their pages so that users without
javascript enabled are served simpler pages that have all the required
html for forms and the necessary html to navigate around the website.
The trick is getting the server to send you those pages. You have to be
good with html and js and dig around a bit to figure it out.

Every browser will let you turn off JavaScript, that’s the easy way to
get served a simple version, which is probably what you want.

On Wed, Oct 31, 2007 at 06:14:01AM +0900, 7stud – wrote:

It sounds like javascript may be adding the fields you want. When you

Nope. You can see the fields in lynx, so it’s definitely not
client-side.