Scrapping data from a webpage where the data is loaded dynamically

musicdenotation · February 6, 2014, 1:06pm

Hi

I was parsing one web page, where data are loaded dynamically when a
search criterion are given, but no change has been seen in the browser
url, it is still then “https://www.kleyntrucks.com/trucks/tractorunit/”.
Thus below code is not helpful to get the right data, as per the search
criterion. Suppose I set the the search field “Matriculation year” as
2003 to 2005, and then if you look at the url, you still would see that
url is “https://www.kleyntrucks.com/trucks/tractorunit/”. Thus the
results are not coming as I am thinking to code.

How can I handle this situation ?

require ‘open-uri’
doc =
Nokogiri::HTML(open(“https://www.kleyntrucks.com/trucks/tractorunit/”))

On the other hand - the
website(Samsung Mobile: Buy Samsung Mobile Phones Online with Exciting Price and Offers @ Flipkart)
seems good. Now suppose I want to scrap the page, when Price has
been selected between 5001-10000, then I would get also its
equivalent url from the browser -
(Samsung Mobile: Buy Samsung Mobile Phones Online with Exciting Price and Offers @ Flipkart).

Accordingly I can use this url in below, then all my code will get
correct data :

require ‘open-uri’
doc =
Nokogiri::HTML(open(Samsung Mobile: Buy Samsung Mobile Phones Online with Exciting Price and Offers @ Flipkart))

How to then proceed with the first
case(https://www.kleyntrucks.com/trucks/tractorunit/) ? Is there anyway
?

my-ruby · February 6, 2014, 1:19pm

you could use mechanize which will allow you to click buttons, fill
forms etc. prior to parsing:
http://mechanize.rubyforge.org/GUIDE_rdoc.html

or if javascript support is required you could use watir to load your
page before parsing with nokogiri:
http://watirwebdriver.com/

my-ruby · February 6, 2014, 1:48pm

How can I handle this situation ?

require ‘open-uri’
doc =
Nokogiri::HTML(open(“https://www.kleyntrucks.com/trucks/tractorunit/”))

On the other hand - the
website(
Samsung Mobile: Buy Samsung Mobile Phones Online with Exciting Price and Offers @ Flipkart
)
seems good. Now suppose I want to scrap the page, when Price has
been selected between 5001-10000, then I would get also its
equivalent url from the browser -
(Online Shopping India | Buy Mobiles, Electronics, Appliances, Clothing and More Online at Flipkart.com
[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0).

Accordingly I can use this url in below, then all my code will get
correct data :

require ‘open-uri’
doc =
Nokogiri::HTML(open(Online Shopping India | Buy Mobiles, Electronics, Appliances, Clothing and More Online at Flipkart.com
[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0))

How to then proceed with the first
case(https://www.kleyntrucks.com/trucks/tractorunit/) ? Is there anyway
?

–
Posted via http://www.ruby-forum.com/.

My first thought is that you are going to have to open the browser,
interact with the page, and then grab the html source.
I would use watir-webdriver but there are other options.

here is an example:

require ‘watir-webdriver’

@browser = Watir::Browser.new :chrome
@browser.goto ‘https://www.kleyntrucks.com/trucks/tractorunit/’

sleep 2
xpath_matriculation_year = ‘//*[@id=“imprp0”]/div[1]’
@browser.div(xpath: xpath_matriculation_year).click

xpath_beginning_year =
‘//*[@id=“imprp0”]/div[2]/div/div[5]/div[1]/input’
@browser.text_field(xpath: xpath_beginning_year).set 2003

xpath_ending_year = ‘//*[@id=“imprp0”]/div[2]/div/div[5]/div[2]/input’
@browser.text_field(xpath: xpath_ending_year).set 2005

odd, but needed or the page refresh resets the value of a field you

set
if you don’t leave the field
@browser.text_field(xpath: xpath_beginning_year).click

sleep 5
page_html = @browser.html

^ then use the page_html in nokogiri

Michael

my-ruby · February 6, 2014, 2:55pm

If no JavaScript is required, then mechanize is a quick and invisible
alternative to watir. Have you tried that yet?

my-ruby · February 6, 2014, 2:37pm

unknown wrote in post #1135823:

How can I handle this situation ?

Posted via http://www.ruby-forum.com/.

My first thought is that you are going to have to open the browser,
interact with the page, and then grab the html source.
I would use watir-webdriver but there are other options.

Yes, ‘selenium-webdriver’ or ‘watir-webdriver’ will be helpful in this
regard. But I am looking for a way to do this in any other way without
webdriver.

Can this lib will be helpful -
Class: Net::HTTP (Ruby 2.1.0) ?

Or please tell me what are the other options, you meant to say.

my-ruby · February 8, 2014, 6:57am

criterion. Suppose I set the the search field “Matriculation year” as
On the other hand - the
require ‘open-uri’

odd, but needed or the page refresh resets the value of a field you set if you

don’t leave the field

@browser.text_field(xpath: xpath_beginning_year).click

sleep 5
page_html = @browser.html

^ then use the page_html in nokogiri

Michael

Michael Hansen

Hi,

I do this kind of web data aggregation daily, though at the moment I’m
using Python.

This is more of a workflow issue than one solved with code.

What I do is open Firebug (or Devtools etc) and look at the net requests
as I interact with the page.

You will have to locate which response brings back the partial
markup/json/xml that the page renders in this case. Once you know the
URL that returns the data, you need to look at the request to see the
parameters it passes. Then you’ve got what you need to scrape the data
using the right partial/api call.

Cheers,

Joe

my-ruby · February 8, 2014, 7:48am

The problem is when I set the the search field “Matriculation year” as
2005 to 2014, I am getting the url
“https://www.kleyntrucks.com/truck/add-facet-value/field/imprp0/from/2005”
from the firebug network tab(request), and if I open also the response
tab, I am getting the correct html, as it is showing in the page.

But I am getting completely different response html, when I am doing the
below :

require “net/http”
require “uri”

uri =
URI.parse(“https://www.kleyntrucks.com/truck/add-facet-value/field/imprp0/from/2005”)

response = Net::HTTP.get_response(uri)

File.open(“/home/kirti/input.txt”,‘w’) do |file|
file.puts response.body
end

And that’s the main problem. Why I am not getting correct response as
showing firebug network tab ?

my-ruby · March 27, 2017, 1:26pm

I think your problem Solved currently, if you have any issue than I will
resolve your problems.

Regards:
https://www.iehk.com/our-products/laser-engraving-cutting-machine/

my-ruby · February 8, 2014, 7:15am

Joseph P. wrote in post #1136006:

criterion. Suppose I set the the search field “Matriculation year” as

You will have to locate which response brings back the partial
markup/json/xml that the page renders in this case. Once you know the
URL that returns the data, you need to look at the request to see the
parameters it passes. Then you’ve got what you need to scrape the data
using the right partial/api call.

Cheers,

Joe

Yes, using Firbug as you said, I got the link as
https://www.kleyntrucks.com/truck/add-facet-value/field/imprp0/from/2005,
when I set the the search field “Matriculation year” as
2005 to 2014. Now what I need to do ?