Is there link extractor or similar html processing libs for


#1

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic


#2

On Wed, 2006-03-08 at 02:23 +0900, Desireco wrote:

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Maybe try:

http://www.crummy.com/software/RubyfulSoup/

#3

Desireco wrote:

Thanks.

Zeljko Dakic
http://www.dakic.com

You meant something like this ? (quite dirty but works)

puts open(“some.html”).read.scan(/<a href="?(.+?)"?>/)

lopex


#4

Thank you guys. RubyfulSoup looks like what I am after.

Zeljko


#5

James Edward G. II wrote:

James Edward G. II

Yep, I realized that after seeing xerces sources :smiley:

lopex


#6

On Mar 7, 2006, at 11:38 AM, Marcin MielżyÅ?ski wrote:

You meant something like this ? (quite dirty but works)

puts open(“some.html”).read.scan(/<a href="?(.+?)"?>/)

No, it doesn’t, trust me. :wink: Toss a simple “\n” in there and you’re
sunk:

Parsing HTML is hard and you don’t want to use regular expressions to
do it.

James Edward G. II


#7

James G. wrote:

Parsing HTML is hard and you don’t want to use regular expressions to
do it.

Rubyful Soup looks great! I’m going to give it a whirl. And I’ve been
doing it the “hard and you don’t want to use regexp” way all this time!
:slight_smile: Relatively successfully, mind you, but this looks even better.

Gentoo users: I made some renegade ebuilds for Rubyful Soup:

http://www.ebuildexchange.org/catview.php?sh_cat_f=dev-ruby

Pistos


#8

On Mar 7, 2006, at 1:36 PM, Bill K. wrote:

to do it.

Hi, not trying to be argumentative, just surprised. I thought
parsing HTML with regexps was pretty easy. Well, lexing HTML into
tokens, I mean.

There’s a lot of pretty darn ugly HTML out there my friend. Here’s a
semi-paranoid attempt to grab just the start of anchor tag:

/<\sa[^>]?href\s*=\s*([’"]?)[^’"]\1?[^>]>/i

Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:

I would try to fix that, but my brain has already melted and leaked
out my ear. :slight_smile: I’m sure I made other mistakes too.

If you want to capture the name of the link too, this gets much worse!

James Edward G. II


#9

From: “James Edward G. II” removed_email_address@domain.invalid

Parsing HTML is hard and you don’t want to use regular expressions to
do it.

Hi, not trying to be argumentative, just surprised. I thought parsing
HTML
with regexps was pretty easy. Well, lexing HTML into tokens, I mean.

Since there are no recursive structures (that I know of) in the syntax
for
an open or closing tag, it seemed reasonably well suited to regexps to
me.

. . . . Heheh, or maybe the passage of time has given the memories a
rosy glow. I just looked up the last HTML lexer I wrote, 5 years ago,
and it’s 19 lines of regexp. Admittedlly it’s a very clean 19 lines,
but still,
lengthier than I remembered… :slight_smile:

Regards,

Bill


#10

James G. wrote:

Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:

If you want to capture the name of the link too, this gets much worse!

I see what you’re getting at: If you’re trying to do
generally-applicable parsing, I suppose you’re headed for a world of
hurt. But all I’ve ever done is page- or site-specific scraping, and
never really considered it a big deal. A few regexps here, a few .scans
there, and you’re done…

Pistos


#11

On Mar 7, 2006, at 2:12 PM, Pistos C. wrote:

A few regexps here, a few .scans there, and you’re done…

Or you can load RubyfulSoup and call find() a few times. About they
same effort, but a lot safer, eh? :wink:

James Edward G. II


#12

Desireco wrote:

Thanks.

Zeljko Dakic
http://www.dakic.com

class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .? ) </ #{s} >
!mix )
{ |unpaired, attr, data| h = { }
( unpaired || attr || “” ).
scan( %r{ ( \w+ ) \s
= \s*
(?: ( ["’] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag(‘a’){|atr,txt| puts atr[‘href’], txt }

END
foo bar
foo bar

upcoming HTML 3.2 reference. All the
is A, with the attribute HREF.
Help |


#13

gem install mechanize

require ‘mechanize’
browser = WWW::Mechanize.new
url = “http://www.eineseite.de
page = browser.get url
page.links.each do |link|
puts “#{url}#{link.href}”
end

take also a look at html tokenizer from gems

Desireco schrieb:


#14

Gregor K. schrieb:

take also a look at html tokenizer from gems

or do a gem search html :wink:


#15

On Mar 7, 2006, at 7:48 PM, William J. wrote:

    ( unpaired || attr || "" ).

DATA.read.xtag(‘a’){|atr,txt| puts atr[‘href’], txt }

END
foo bar
foo bar

upcoming HTML 3.2 reference. All the
is A, with the attribute HREF.
Help |

Javascript Link

James Edward G. II


#16

James Edward G. II wrote:

      !mix )  \

end
is A, with the attribute HREF.
Help |

Javascript Link

class String
def xtag(str)
result = [] ; re =
%r{ < #{str} (?: \s+ ( (?> [^>"/]* (?> “[^”]" )? ) ) )? }xi
scan( %r{ #{re} / > | #{re} > ( .? ) </ #{str} >
}mix )
{ |unpaired, attr, data| h = { }
( unpaired || attr || “” ).
scan( %r{ ( \w+ ) \s
= \s*
(?: ( ["’] ) ( .*? ) \2 | ( \S+ ) )
}mx ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag(‘a’){|atr,txt| puts “-”*9; p atr[‘href’]; puts txt }

END
foo bar
foo bar

upcoming HTML 3.2 reference. All the
is A, with the attribute HREF.
Help |
Javascript Link
<a name = “foo-bar”
href = “if (foo_bar > 14)
{ fluct() }”

Javascript “circumlocutory” Link


#17

“Bill K.” removed_email_address@domain.invalid writes:

to do it.

Hi, not trying to be argumentative, just surprised. I thought parsing
HTML with regexps was pretty easy. Well, lexing HTML into tokens, I
mean.

Lex, yes. Scrape in general, no.

(And those who think that’s BS, please have a look at REXML.)