Hi,
in cool Perl there are a bunch of libraries that process html files and
help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me
if he could point me in right direction. Basically I need to extract
links and info from html pages.
Thanks.
Zeljko Dakic
Dakic OnLine is a boutique agile software development shop based in Chicago, IL. We specialize in the production of online applications and websites, as well as custom software development using the latest and trendiest tools and methodologies....
On Wed, 2006-03-08 at 02:23 +0900, Desireco wrote:
Hi,
in cool Perl there are a bunch of libraries that process html files and
help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me
if he could point me in right direction. Basically I need to extract
links and info from html pages.
Maybe try:
http://www.crummy.com/software/RubyfulSoup/
Desireco wrote:
Thanks.
Zeljko Dakic
http://www.dakic.com
You meant something like this ? (quite dirty but works)
puts open(“some.html”).read.scan(/<a href=“?(.+?)”?>/)
lopex
Thank you guys. RubyfulSoup looks like what I am after.
Zeljko
James Edward G. II wrote:
James Edward G. II
Yep, I realized that after seeing xerces sources
lopex
On Mar 7, 2006, at 11:38 AM, Marcin MielżyÅ?ski wrote:
You meant something like this ? (quite dirty but works)
puts open(“some.html”).read.scan(/<a href="?(.+?)"?>/)
No, it doesn’t, trust me. Toss a simple “\n” in there and you’re
sunk:
Parsing HTML is hard and you don’t want to use regular expressions to
do it.
James Edward G. II
James G. wrote:
Parsing HTML is hard and you don’t want to use regular expressions to
do it.
Rubyful Soup looks great! I’m going to give it a whirl. And I’ve been
doing it the “hard and you don’t want to use regexp” way all this time!
Relatively successfully, mind you, but this looks even better.
Gentoo users: I made some renegade ebuilds for Rubyful Soup:
http://www.ebuildexchange.org/catview.php?sh_cat_f=dev-ruby
Pistos
On Mar 7, 2006, at 1:36 PM, Bill K. wrote:
to do it.
Hi, not trying to be argumentative, just surprised. I thought
parsing HTML with regexps was pretty easy. Well, lexing HTML into
tokens, I mean.
There’s a lot of pretty darn ugly HTML out there my friend. Here’s a
semi-paranoid attempt to grab just the start of anchor tag:
/<\sa[^>] ?href\s*=\s*([’"]?)[^’"]\1?[^>] >/i
Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:
I would try to fix that, but my brain has already melted and leaked
out my ear. I’m sure I made other mistakes too.
If you want to capture the name of the link too, this gets much worse!
James Edward G. II
From: “James Edward G. II” [email protected]
Parsing HTML is hard and you don’t want to use regular expressions to
do it.
Hi, not trying to be argumentative, just surprised. I thought parsing
HTML
with regexps was pretty easy. Well, lexing HTML into tokens, I mean.
Since there are no recursive structures (that I know of) in the syntax
for
an open or closing tag, it seemed reasonably well suited to regexps to
me.
. . . . Heheh, or maybe the passage of time has given the memories a
rosy glow. I just looked up the last HTML lexer I wrote, 5 years ago,
and it’s 19 lines of regexp. Admittedlly it’s a very clean 19 lines,
but still,
lengthier than I remembered…
Regards,
Bill
James G. wrote:
Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:
If you want to capture the name of the link too, this gets much worse!
I see what you’re getting at: If you’re trying to do
generally-applicable parsing, I suppose you’re headed for a world of
hurt. But all I’ve ever done is page- or site-specific scraping, and
never really considered it a big deal. A few regexps here, a few .scans
there, and you’re done…
Pistos
On Mar 7, 2006, at 2:12 PM, Pistos C. wrote:
A few regexps here, a few .scans there, and you’re done…
Or you can load RubyfulSoup and call find() a few times. About they
same effort, but a lot safer, eh?
James Edward G. II
Desireco wrote:
Thanks.
Zeljko Dakic
http://www.dakic.com
class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .? ) </ #{s} >
!mix )
{ |unpaired, attr, data| h = { }
( unpaired || attr || “” ).
scan( %r{ ( \w+ ) \s = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end
DATA.read.xtag(‘a’){|atr,txt| puts atr[‘href’], txt }
END
foo bar
foo bar
upcoming HTML 3.2 reference . All the
is A , with the attribute HREF.
Help |
gem install mechanize
require ‘mechanize’
browser = WWW::Mechanize.new
url = “http://www.eineseite.de ”
page = browser.get url
page.links.each do |link|
puts “#{url}#{link.href}”
end
take also a look at html tokenizer from gems
Desireco schrieb:
On Mar 7, 2006, at 7:48 PM, William J. wrote:
( unpaired || attr || "" ).
DATA.read.xtag(‘a’){|atr,txt| puts atr[‘href’], txt }
END
foo bar
foo bar
upcoming HTML 3.2 reference . All the
is A , with the attribute HREF.
Help |
Javascript Link
James Edward G. II
James Edward G. II wrote:
!mix ) \
end
is A , with the attribute HREF.
Help |
Javascript Link
class String
def xtag(str)
result = [] ; re =
%r{ < #{str} (?: \s+ ( (?> [^>"/]* (?> “[^”]" )? ) ) )? }xi
scan( %r{ #{re} / > | #{re} > ( .? ) </ #{str} >
}mix )
{ |unpaired, attr, data| h = { }
( unpaired || attr || “” ).
scan( %r{ ( \w+ ) \s = \s*
(?: ( ["’] ) ( .*? ) \2 | ( \S+ ) )
}mx ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end
DATA.read.xtag(‘a’){|atr,txt| puts “-”*9; p atr[‘href’]; puts txt }
END
foo bar
foo bar
upcoming HTML 3.2 reference . All the
is A , with the attribute HREF.
Help |
Javascript Link
<a name = “foo-bar”
href = “if (foo_bar > 14)
{ fluct() }”
Javascript “circumlocutory” Link
“Bill K.” [email protected] writes:
to do it.
Hi, not trying to be argumentative, just surprised. I thought parsing
HTML with regexps was pretty easy. Well, lexing HTML into tokens, I
mean.
Lex, yes. Scrape in general, no.
(And those who think that’s BS, please have a look at REXML.)