Forum: Ruby why I cannot extract all the info from a twitter webpage?

73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-06 18:16
Hi all,

I want to extract some  info from one webpage. I get the source of
webpage by using "view page source" and save it as a html file on my
computer. Then I use hpricot to extract the info I need. To my supprise,
I am only able to extract about 20 records/info for this webpage. I know
I miss many as  I can see many records on the webpage but I just cannot
extract. It looks like the source page I get only contains about 20
records. Any comments on how to get all records from this webpage?

one example that I am interested in getting this page likes this:

Mole: signs of trouble
ABCDE:
Asymmetry
Border irregular
Colour irregular
Diameter usually > 0.5cm
Elevation irregular

Thanks,


##############################
require 'hpricot'
require 'open-uri'

file_name="https://twitter.com/master_usmle"

f=open(file_name)
doc=Hpricot(f)

#get the questions
questions=[]
doc.search("/html/body/div//p[@class='js-tweet-text tweet-text'").each
do |c|

  puts c=c.inner_html.to_s
  c=c.split(/
|'/).join("\"") if c.match(/
|'/)
  questions<<c
end
4a65f01f7ece0b720bdb0de3c3db089e?d=identicon&s=25 Dansei Yuuki (blutorange)
on 2014-08-08 13:44
Because it is not there. As you scroll down, more tweets are loaded
dynamically. Have you considered using twitter's api?

https://dev.twitter.com/docs/api/1.1
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-08 23:24
But I already scroll down to the end of the page before I view the
source code.

I just check the link you post it. I see the max twitter can be return
is about 800. I will try it later.

Thanks.
4a65f01f7ece0b720bdb0de3c3db089e?d=identicon&s=25 Dansei Yuuki (blutorange)
on 2014-08-09 08:04
> But I already scroll down to the end of the page

Right click -> "View Page Source" still shows the original html. What
you can do, if your browser supports it, is to switch to the "Inspect
Element" mode (F12 in chrome) and save what you get there. For chrome,
right click on the top node (<html>) and select "Copy HTML".
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-09 18:19
Hi Dansei,

Thank you very much. It works and I extract all info from this website
by using " Inspect Element">"Copy HTML",save it as a local file, then
run a script to extract all the info I need.


Two quick questions: 1) what is the difference between " view page
source" and "Inspect element"?

2)  the info I like to extract is what included in between <p> and </p>


<p class="ProfileTweet-text js-tweet-text u-dir" dir="ltr">Brachial
plexus organization
"The Castrated Dog Turns Rabid":
· From lateral to medial:
Terminal branches
Cords
Divisions
Trunks
Roots</p>

I can use this line
("/html/body/div//p[@class='ProfileTweet-text js-tweet-text u-dir']")

to search the local file and get what I want but cannot use

("/html/body/div//p[@dir='ltr']")

so what is difference between tag class and tag dir here?


Once again thank very much.
4a65f01f7ece0b720bdb0de3c3db089e?d=identicon&s=25 Dansei Yuuki (blutorange)
on 2014-08-09 19:38
Attachment: rbforum.rb (561 Bytes)
Check your code, it works for me. Try the attached ruby script, run on
this file https://www.sendspace.com/file/7cvuzc (extracted from
https://twitter.com/master_usmle).

> ruby rbforum.rb <html file>

It produces the following output:

> with </html/body/div//p[@dir='ltr']>:
>
> 889 hits
>
> Brachial plexus organization
> "The Castrated Dog Turns Rabid":
> · From lateral to medial:
> Terminal branches
> Cords
> Divisions
> Trunks
> Roots
> --------------------
> with </html/body/div//p[@class='ProfileTweet-text js-tweet-text u-dir']>:
>
> 833 hits
>
> Brachial plexus organization
> "The Castrated Dog Turns Rabid":
> · From lateral to medial:
> Terminal branches
> Cords
> Divisions
> Trunks
> Roots


There are some nodes with dir='ltr' you may not want, such as
> {elem <p class="ProfileCard-bio u-dir" dir="ltr"> "Right above it" </p>}

> difference between " view page source" and "Inspect element"?

"View page source" just gives you the original html. "Inspect element"
lets you, well, inspect the page in detail. It's the html after
javascript, css &.c has been applied, ie the page as it looks right now.
https://www.webkit.org/blog/197/web-inspector-redesign/

Also,

> I see the max twitter can be return is about 800.

https://dev.twitter.com/docs/api/1.1/get/statuses/...
> This method can only return up to 800 tweets. See Working with Timelines
> for instructions on traversing timelines.

https://dev.twitter.com/docs/working-with-timelines
> Such timelines can grow very large, so there are limits to how much of a
> timeline a client application may fetch in a single request. Applications
> must therefore iterate through timeline results in order to build a more
> complete list.
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2014-08-09 20:30
I get about 847 hits. After removing the duplicates, i get about 247
unique hits, which are what I want.


Once again thank you so much.
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.