Why I cannot extract all the info from a twitter webpage?

addis_a · August 6, 2014, 6:16pm

Hi all,

I want to extract some info from one webpage. I get the source of
webpage by using “view page source” and save it as a html file on my
computer. Then I use hpricot to extract the info I need. To my supprise,
I am only able to extract about 20 records/info for this webpage. I know
I miss many as I can see many records on the webpage but I just cannot
extract. It looks like the source page I get only contains about 20
records. Any comments on how to get all records from this webpage?

one example that I am interested in getting this page likes this:

Mole: signs of trouble
ABCDE:
Asymmetry
Border irregular
Colour irregular
Diameter usually > 0.5cm
Elevation irregular

Thanks,

##############################
require ‘hpricot’
require ‘open-uri’

file_name=“https://twitter.com/master_usmle”

f=open(file_name)
doc=Hpricot(f)

#get the questions
questions=[]
doc.search(“/html/body/div//p[@class=‘js-tweet-text tweet-text’”).each
do |c|

puts c=c.inner_html.to_s
c=c.split(/ |'/).join(“"”) if c.match(/ |'/)
questions<<c
end

alex-osu3 · August 8, 2014, 1:44pm

Because it is not there. As you scroll down, more tweets are loaded
dynamically. Have you considered using twitter’s api?

https://dev.twitter.com/docs/api/1.1

alex-osu3 · August 8, 2014, 11:24pm

But I already scroll down to the end of the page before I view the
source code.

I just check the link you post it. I see the max twitter can be return
is about 800. I will try it later.

Thanks.

alex-osu3 · August 9, 2014, 8:04am

But I already scroll down to the end of the page

Right click -> “View Page Source” still shows the original html. What
you can do, if your browser supports it, is to switch to the “Inspect
Element” mode (F12 in chrome) and save what you get there. For chrome,
right click on the top node () and select “Copy HTML”.

alex-osu3 · August 9, 2014, 6:19pm

Hi Dansei,

Thank you very much. It works and I extract all info from this website
by using " Inspect Element">“Copy HTML”,save it as a local file, then
run a script to extract all the info I need.

Two quick questions: 1) what is the difference between " view page
source" and “Inspect element”?

the info I like to extract is what included in between
and

Brachial plexus organization "The Castrated Dog Turns Rabid": · From lateral to medial: Terminal branches Cords Divisions Trunks Roots

I can use this line
("/html/body/div//p[@class=‘ProfileTweet-text js-tweet-text u-dir’]")

to search the local file and get what I want but cannot use

("/html/body/div//p[@dir=‘ltr’]")

so what is difference between tag class and tag dir here?

Once again thank very much.

alex-osu3 · August 9, 2014, 8:30pm

I get about 847 hits. After removing the duplicates, i get about 247
unique hits, which are what I want.

Once again thank you so much.

alex-osu3 · August 9, 2014, 7:38pm

Check your code, it works for me. Try the attached ruby script, run on
this file Free large file hosting. Send big files the easy way! (extracted from
https://twitter.com/master_usmle).

ruby rbforum.rb

It produces the following output:

with </html/body/div//p[@dir=‘ltr’]>:

889 hits

Brachial plexus organization
“The Castrated Dog Turns Rabid”:
· From lateral to medial:
Terminal branches
Cords
Divisions
Trunks
Roots

with </html/body/div//p[@class=‘ProfileTweet-text js-tweet-text u-dir’]>:

833 hits

Brachial plexus organization
“The Castrated Dog Turns Rabid”:
· From lateral to medial:
Terminal branches
Cords
Divisions
Trunks
Roots

There are some nodes with dir=‘ltr’ you may not want, such as

{elem

“Right above it”
}

difference between " view page source" and “Inspect element”?

“View page source” just gives you the original html. “Inspect element”
lets you, well, inspect the page in detail. It’s the html after
javascript, css &.c has been applied, ie the page as it looks right now.

Also,

I see the max twitter can be return is about 800.

https://dev.twitter.com/docs/api/1.1/get/statuses/mentions_timeline

This method can only return up to 800 tweets. See Working with Timelines
for instructions on traversing timelines.

Such timelines can grow very large, so there are limits to how much of a
timeline a client application may fetch in a single request. Applications
must therefore iterate through timeline results in order to build a more
complete list.

Why I cannot extract all the info from a twitter webpage?

Brachial plexus organization “The Castrated Dog Turns Rabid”: · From lateral to medial: Terminal branches Cords Divisions Trunks Roots

Brachial plexus organization
“The Castrated Dog Turns Rabid”:
· From lateral to medial:
Terminal branches
Cords
Divisions
Trunks
Roots