Firefox html, my downloaded html and firebug html different?

scudco · August 16, 2008, 10:17am

Hi Im a relatively new rubyist and programmer in general and currently
reading Everyday scripting and trying out webscraping using amazon as a
target.

To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching

For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???

Here is my code

def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end

html = get_web_page_text(‘http://www.amazon.com/gp/product/0974514055’)

File.open(‘html.txt’,‘w’) do |out|
out << html
end

scudco · August 16, 2008, 10:19am

Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted.

Should read

Some of my regular expressions would NOT work …

scudco · August 16, 2008, 12:16pm

Adam A. wrote:

regex editor and applied my reg ex and it highlighted. So this made
def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end

html = get_web_page_text(‘http://www.amazon.com/gp/product/0974514055’)

File.open(‘html.txt’,‘w’) do |out|
out << html
end

The “view source” function of Firefox shows you the source code of the
page as it was downloaded from the server. Firebug is more
sophisticated, it shows you the DOM tree of the document. Javascript
can alter the DOM tree (which is essentially what AJAX does), so you
might be seeing the DOM tree after it’s been modified by some Javascript
code.

scudco · August 16, 2008, 12:16pm

Hello there.

I don’t know why the html differs, but I think the browsers must send
some different data to the server in their requests. Maybe you have some
cookies for the page in one of them, and don’t have any in the other, or
maybe the Accept header is different. You can check what your Firefox
sends using Data Tamper:
https://addons.mozilla.org/en-US/firefox/addon/966 .

Now if you want your own requests from Ruby to be more flexible, use
Net::HTTP instead.

require ‘net/http’

Net::HTTP::start(“www.amazon.com”)
{ |http|
header=
{
“Accept”=>“/”,
“User-Agent”=>“MyRubyProgram”,
}
h,b=*http.get(“/”)
p h
p b

p h.code
p h.message
p h.to_hash

}

scudco · August 16, 2008, 5:10pm

Every browser cleans up invalid markup. Each one has a different way
to do it. Firefox, for example, adds to every

a , when
it doesn’t exist. Firebug shows you the cleaned up source.
I had to download a website once, because it was so crappy and I
searched for the table entry by hand. It had a path like
“\html\body\table\tr\td\tr\center\font\b\font”. Quite annoying, but it
speeded up scraping.

You could try the hpricot gem to get data from websites if the regex
become to complex.

scudco · August 16, 2008, 12:20pm

Thomas Bl. wrote:

header=
{
“Accept”=>"/",
“User-Agent”=>“MyRubyProgram”,
}
h,b=*http.get("/")

Sorry, it should be h,b=*http.get("/",header) of course.

scudco · August 16, 2008, 6:16pm

Thank you everyone for your help so far.

I tackled the problem by not viewing firefox source or firebugs, instead
i just saved and viewed the html code downloaded via open(a_url) with
open-uri. I wasnt sure if this did any tidy up like firefox or firebug
but after various trys i could be confident that what it downloaded was
the real deal.

That meant though that id have to view the html in something like
notepad and it aint easy to read. I really wish I could rely on the code
firebug displays as its so easy to find the areas you need to restrict
your searches to.

As a side note is there a plugin or libary that sits in between your
code and the targets webpage. The plugin would clean the code as its
downloaded? Perhaps in an identical way to firebug?

The upside would be that it would be easier to grab what you want as
there would be more regular structure, downside i guess would be longer
run times. Just a thought though.

Ive used hpricot a while ago and it wasnt so great on badly designed
webpages so i ended up resorting to regexps. But if i find a nice
website i think ill give it another try!

scudco · August 16, 2008, 7:44pm

Why advertise your HTML is sloppy?

Hi Phil,

Its not my html though, its a third partys website that im scraping so I
cant fix the HTML.

scudco · August 16, 2008, 9:59pm

On Aug 16, 2008, at 5:07 PM, Thomas W. wrote:

Every browser cleans up invalid markup. Each one has a different way
to do it. Firefox, for example, adds to every
a , when
it doesn’t exist. Firebug shows you the cleaned up source.
Actually, if the table has no header, no footer and only one ody, the
tbody-tag is not
required but implicitly assumed.
So it always exists in the dom displayed by firebug (as it is added
and thus existing)
but does not when you manipulate the document with a tool that does not
build the dom beforehand (the source viewer).
Regards,
Florian G.

scudco · August 16, 2008, 7:22pm

Adam A. wrote:

As a side note is there a plugin or libary that sits in between your
code and the targets webpage. The plugin would clean the code as its
downloaded? Perhaps in an identical way to firebug?

Why advertise your HTML is sloppy?

At work, we use assert_xpath, assert_tidy, and LibXML in all our
functional
tests. They scream bloody murder if we have a single ill-formed ID. Then
we
clean up our html.erb and keep going.