Forum: Ruby Scrapping data from a webpage where the data is loaded dynamically

249c7fd851c5c5ac5a1abdb756472ae1?d=identicon&s=25 Arup Rakshit (my-ruby)
on 2014-02-06 13:06
Hi

I was parsing one web page, where data are loaded dynamically when a
search criterion are given, but no change has been seen in the browser
url, it is still then "https://www.kleyntrucks.com/trucks/tractorunit/".
Thus below code is not helpful to get the right data, as per the search
criterion. Suppose I set the the search field "Matriculation year" as
2003 to 2005, and then if you look at the url, you still would see that
url is "https://www.kleyntrucks.com/trucks/tractorunit/". Thus the
results are not coming as I am thinking to code.

How can I handle this situation ?

require 'open-uri'
doc =
Nokogiri::HTML(open("https://www.kleyntrucks.com/trucks/tractorunit/"))

On the other hand - the
website(http://www.flipkart.com/mobiles/samsung~brand/pr?s...)
seems good. Now suppose I want to scrap the page, when **Price** has
been selected between **5001-10000**, then I would get also its
equivalent url from the browser -
(http://www.flipkart.com/mobiles/samsung~brand/pr?p...).


Accordingly I can use this url in below, then all my code will get
correct data :

require 'open-uri'
doc =
Nokogiri::HTML(open(http://www.flipkart.com/mobiles/samsung~brand/pr?p...))

How to then proceed with the first
case(https://www.kleyntrucks.com/trucks/tractorunit/) ? Is there anyway
?
4b3702fbad380c523f32eafb18d42d8e?d=identicon&s=25 unknown (Guest)
on 2014-02-06 13:19
(Received via mailing list)
you could use mechanize which will allow you to click buttons, fill
forms etc. prior to parsing:
http://mechanize.rubyforge.org/GUIDE_rdoc.html

or if javascript support is required you could use watir to load your
page before parsing with nokogiri:
http://watirwebdriver.com/
920f6e4a0fbf997d851455f827a10ebc?d=identicon&s=25 unknown (Guest)
on 2014-02-06 13:48
(Received via mailing list)
>
> How can I handle this situation ?
>
> require 'open-uri'
> doc =
> Nokogiri::HTML(open("https://www.kleyntrucks.com/trucks/tractorunit/"))
>
> On the other hand - the
> website(
http://www.flipkart.com/mobiles/samsung~brand/pr?s...
)
> seems good. Now suppose I want to scrap the page, when **Price** has
> been selected between **5001-10000**, then I would get also its
> equivalent url from the browser -
> (http://www.flipkart.com/mobiles/samsung~brand/pr?p
[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0).
>
>
> Accordingly I can use this url in below, then all my code will get
> correct data :
>
> require 'open-uri'
> doc =
> Nokogiri::HTML(open(http://www.flipkart.com/mobiles/samsung~brand/pr?p
[]=facets.price_range%255B%255D%3DRs.%2B5001%2B-%2BRs.%2B10000&p[]=sort%3Dfeatured&sid=tyy%2C4io&ref=3de01d19-4e68-4ba9-9b3f-9c43497931b0))
>
> How to then proceed with the first
> case(https://www.kleyntrucks.com/trucks/tractorunit/) ? Is there anyway
> ?
>
> --
> Posted via http://www.ruby-forum.com/.
>

My first thought is that you are going to have to open the browser,
interact with the page, and then grab the html source.
I would use watir-webdriver but there are other options.

here is an example:


require 'watir-webdriver'

@browser = Watir::Browser.new :chrome
@browser.goto 'https://www.kleyntrucks.com/trucks/tractorunit/'

sleep 2
xpath_matriculation_year = '//*[@id="imprp0"]/div[1]'
@browser.div(xpath: xpath_matriculation_year).click

xpath_beginning_year =
'//*[@id="imprp0"]/div[2]/div/div[5]/div[1]/input'
@browser.text_field(xpath: xpath_beginning_year).set 2003

xpath_ending_year = '//*[@id="imprp0"]/div[2]/div/div[5]/div[2]/input'
@browser.text_field(xpath: xpath_ending_year).set 2005

# odd, but needed or the page refresh resets the value of a field you
set
if you don't leave the field
@browser.text_field(xpath: xpath_beginning_year).click

sleep 5
page_html =  @browser.html

^ then use the page_html in nokogiri

Michael
249c7fd851c5c5ac5a1abdb756472ae1?d=identicon&s=25 Arup Rakshit (my-ruby)
on 2014-02-06 14:37
unknown wrote in post #1135823:
>>
>> How can I handle this situation ?
>>

>> Posted via http://www.ruby-forum.com/.
>>
>
> My first thought is that you are going to have to open the browser,
> interact with the page, and then grab the html source.
> I would use watir-webdriver but there are other options.

Yes, 'selenium-webdriver' or 'watir-webdriver' will be helpful in this
regard. But I am looking for a way to do this in any other way without
webdriver.

Can this lib will be helpful -
http://ruby-doc.org/stdlib-2.1.0/libdoc/net/http/r... ?

Or please tell me what are the other options, you meant to say.
14b5582046b4e7b24ab69b7886a35868?d=identicon&s=25 Joel Pearson (virtuoso)
on 2014-02-06 14:55
If no JavaScript is required, then mechanize is a quick and invisible
alternative to watir. Have you tried that yet?
135775b8136541e81508d7f649c9e312?d=identicon&s=25 Joseph Phillips (Guest)
on 2014-02-08 06:57
(Received via mailing list)
>> criterion. Suppose I set the the search field "Matriculation year" as
>> On the other hand - the
>> require 'open-uri'
>
>
># odd, but needed or the page refresh resets the value of a field you set if you
don't leave the field
>@browser.text_field(xpath: xpath_beginning_year).click
>
>sleep 5
>page_html = @browser.html
>
>^ then use the page_html in nokogiri
>
>Michael
>________
>Michael Hansen

Hi,

I do this kind of web data aggregation daily, though at the moment I'm
using Python.

This is more of a workflow issue than one solved with code.

What I do is open Firebug (or Devtools etc) and look at the net requests
as I interact with the page.

You will have to locate which response brings back the partial
markup/json/xml that the page renders in this case. Once you know the
URL that returns the data, you need to look at the request to see the
parameters it passes. Then you've got what you need to scrape the data
using the right partial/api call.

Cheers,

Joe
249c7fd851c5c5ac5a1abdb756472ae1?d=identicon&s=25 Arup Rakshit (my-ruby)
on 2014-02-08 07:15
Joseph Phillips wrote in post #1136006:
>>> criterion. Suppose I set the the search field "Matriculation year" as


> You will have to locate which response brings back the partial
> markup/json/xml that the page renders in this case. Once you know the
> URL that returns the data, you need to look at the request to see the
> parameters it passes. Then you've got what you need to scrape the data
> using the right partial/api call.
>
> Cheers,
>
> Joe

Yes, using `Firbug` as you said, I got the link as
https://www.kleyntrucks.com/truck/add-facet-value/...,
when I set the the search field "Matriculation year" as
2005 to 2014. Now what I need to do ?
249c7fd851c5c5ac5a1abdb756472ae1?d=identicon&s=25 Arup Rakshit (my-ruby)
on 2014-02-08 07:48
The problem is when I set the the search field "Matriculation year" as
2005 to 2014, I am getting the url
"https://www.kleyntrucks.com/truck/add-facet-value/...
from the firebug network tab(request), and if I open also the response
tab, I am getting the correct html, as it is showing in the page.


But I am getting completely different response html, when I am doing the
below :

require "net/http"
require "uri"


uri =
URI.parse("https://www.kleyntrucks.com/truck/add-facet-value/...)

response = Net::HTTP.get_response(uri)

File.open("/home/kirti/input.txt",'w') do |file|
  file.puts response.body
end

And that's the main problem. Why I am not getting correct response as
showing firebug network tab ?
08c40dbcbbc2e982da559dae40159f89?d=identicon&s=25 Heely David (heelydavid)
on 2017-03-27 13:26
I think your problem Solved currently, if you have any issue than I will
resolve your problems.

Regards:
https://www.iehk.com/our-products/laser-engraving-...
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.