Get all site tree with Ruby

Detlef_R · May 12, 2015, 7:43am

Hello!

I need to grab all site data with all tree structure. Every page have
links
to children pages. How to build site tree with Nokogiri? It must be
recursive page visiting and scraping all directory links, but I can’t
recognize full algorhytm. How to do that?
P.S. And I don’t need to “Save all site on disk with HTTRack”. Data will
be
processed and copied on the new version of redesigned original site.

S_SSSSSSSSSSS · May 12, 2015, 12:10pm

At which point you’re get stuck?

Simply GET index page, parse it via nokogiri, select tags which you
interested in, extract urls from href attribute, do recursive GET on
these
urls.
Each page type should have its own function that performs GET and
parsing.

If you have to fetch pretty huge amount of pages, then you need to store
your grabbing state somewhere in database. For example, keep separate
table
for urls to be parsed. (url is a unique key), and mark rows a “to be
parsed” and “already parsed”. Of course you need to normalize all urls,
not
avoid duplicates in table.

Да и мог бы спросить в ror2ru.

S_SSSSSSSSSSS · May 12, 2015, 12:20pm

Some time ago I solved similar problem (but I needed continuous
grabbing),
organizing several workers:
https://medium.com/@vladimir_vg/dsl-74d0fcf03cae
(in Russian language)
Probably you do not need such a complex thing, but you may get some
ideas
from it.

S_SSSSSSSSSSS · May 13, 2015, 2:37am

On May 12, 2015, at 5:21 PM, Роман Ярыгин [email protected] wrote:

I stuck exactly on recursive algoritm. Can’t find out how to build that
recursive function

It’s recursion, you call it again…

def start
get_subtree(‘/‘)
end

def get_subtree(url)
#fetch the page
#parse it
#for each link
#normalize the link
#if link not already visited
#add link to table of visited links
get_subtree(link)
#end
#end
end

–
Scott R.
[email protected]
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice

S_SSSSSSSSSSS · May 13, 2015, 1:22am

I stuck exactly on recursive algoritm. Can’t find out how to build that
recursive function

Вот как раз на этой рекурсивной функции я и застрял. Не могу допетрить
как
ее написать.

вторник, 12 мая 2015 г., 20:09:51 UTC+10 пользователь Vladimir Gordeev
написал:

S_SSSSSSSSSSS · May 13, 2015, 3:46am

Yeah, thanks. I figured it out. Now I stuck with million other problems,
but this is another theme =)

среда, 13 мая 2015 г., 10:36:34 UTC+10 пользователь Scott R. написал: