Hi all, I want to extract some info from one webpage. I get the source of webpage by using "view page source" and save it as a html file on my computer. Then I use hpricot to extract the info I need. To my supprise, I am only able to extract about 20 records/info for this webpage. I know I miss many as I can see many records on the webpage but I just cannot extract. It looks like the source page I get only contains about 20 records. Any comments on how to get all records from this webpage? one example that I am interested in getting this page likes this: Mole: signs of trouble ABCDE: Asymmetry Border irregular Colour irregular Diameter usually > 0.5cm Elevation irregular Thanks, ############################## require 'hpricot' require 'open-uri' file_name="https://twitter.com/master_usmle" f=open(file_name) doc=Hpricot(f) #get the questions questions= doc.search("/html/body/div//p[@class='js-tweet-text tweet-text'").each do |c| puts c=c.inner_html.to_s c=c.split(/ |'/).join("\"") if c.match(/ |'/) questions<<c end
on 2014-08-06 18:16
on 2014-08-08 13:44
Because it is not there. As you scroll down, more tweets are loaded dynamically. Have you considered using twitter's api? https://dev.twitter.com/docs/api/1.1
on 2014-08-08 23:24
But I already scroll down to the end of the page before I view the source code. I just check the link you post it. I see the max twitter can be return is about 800. I will try it later. Thanks.
on 2014-08-09 08:04
> But I already scroll down to the end of the page Right click -> "View Page Source" still shows the original html. What you can do, if your browser supports it, is to switch to the "Inspect Element" mode (F12 in chrome) and save what you get there. For chrome, right click on the top node (<html>) and select "Copy HTML".
on 2014-08-09 18:19
Hi Dansei, Thank you very much. It works and I extract all info from this website by using " Inspect Element">"Copy HTML",save it as a local file, then run a script to extract all the info I need. Two quick questions: 1) what is the difference between " view page source" and "Inspect element"? 2) the info I like to extract is what included in between <p> and </p> <p class="ProfileTweet-text js-tweet-text u-dir" dir="ltr">Brachial plexus organization "The Castrated Dog Turns Rabid": · From lateral to medial: Terminal branches Cords Divisions Trunks Roots</p> I can use this line ("/html/body/div//p[@class='ProfileTweet-text js-tweet-text u-dir']") to search the local file and get what I want but cannot use ("/html/body/div//p[@dir='ltr']") so what is difference between tag class and tag dir here? Once again thank very much.
on 2014-08-09 19:38
on 2014-08-09 20:30
I get about 847 hits. After removing the duplicates, i get about 247 unique hits, which are what I want. Once again thank you so much.