Forum: Ruby Is there link extractor or similar html processing libs for

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
F9d6c4b25299a8dacf69b58583d34b39?d=identicon&s=25 Desireco (Guest)
on 2006-03-07 18:26
(Received via mailing list)
Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.


Zeljko Dakic
http://www.dakic.com
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-03-07 18:35
(Received via mailing list)
On Wed, 2006-03-08 at 02:23 +0900, Desireco wrote:
> Hi,
>
> in cool Perl there are a bunch of libraries that process html files and
>
> help you when you need to extract info. I remember hearing something
> for Ruby as well, if someone had experience with this, it would help me
>
> if he could point me in right direction. Basically I need to extract
> links and info from html pages.

Maybe try:

	http://www.crummy.com/software/RubyfulSoup/
B97225f66bb5caac601b12735d430a0d?d=identicon&s=25 Marcin MielżyÅ?ski (Guest)
on 2006-03-07 18:41
(Received via mailing list)
Desireco wrote:
> Thanks.
>
>
> Zeljko Dakic
> http://www.dakic.com
>

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

lopex
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-03-07 18:47
(Received via mailing list)
On Mar 7, 2006, at 11:38 AM, Marcin MielżyÅ?ski wrote:

>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

No, it doesn't, trust me.  ;)  Toss a simple "\n" in there and you're
sunk:

<a
  href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to
do it.

James Edward Gray II
F9d6c4b25299a8dacf69b58583d34b39?d=identicon&s=25 Desireco (Guest)
on 2006-03-07 19:36
(Received via mailing list)
Thank you guys. RubyfulSoup looks like what I am after.

Zeljko
B97225f66bb5caac601b12735d430a0d?d=identicon&s=25 Marcin MielżyÅ?ski (Guest)
on 2006-03-07 20:00
(Received via mailing list)
James Edward Gray II wrote:

> James Edward Gray II
>
>
>


Yep, I realized that after seeing xerces sources :D

lopex
4feed660d3728526797edeb4f0467384?d=identicon&s=25 Bill Kelly (Guest)
on 2006-03-07 20:37
(Received via mailing list)
From: "James Edward Gray II" <james@grayproductions.net>
>
> Parsing HTML is hard and you don't want to use regular expressions to
> do it.

Hi, not trying to be argumentative, just surprised.  I thought parsing
HTML
with regexps was pretty easy.  Well, lexing HTML into tokens, I mean.

Since there are no recursive structures (that I know of) in the syntax
for
an open or closing tag, it seemed reasonably well suited to regexps to
me.

 . . . . Heheh, or maybe the passage of time has given the memories a
rosy glow.  I just looked up the last HTML lexer I wrote, 5 years ago,
and it's 19 lines of regexp.  Admittedlly it's a very clean 19 lines,
but still,
lengthier than I remembered....  :)


Regards,

Bill
A402df36168b81b31c17adcbb5ae8cf4?d=identicon&s=25 Pistos Christou (pistos)
on 2006-03-07 21:03
James Gray wrote:
> Parsing HTML is hard and you don't want to use regular expressions to
> do it.

Rubyful Soup looks great!  I'm going to give it a whirl.  And I've been
doing it the "hard and you don't want to use regexp" way all this time!
:)  Relatively successfully, mind you, but this looks even better.

Gentoo users: I made some renegade ebuilds for Rubyful Soup:

http://www.ebuildexchange.org/catview.php?sh_cat_f=dev-ruby

Pistos
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-03-07 21:07
(Received via mailing list)
On Mar 7, 2006, at 1:36 PM, Bill Kelly wrote:

>> to  do it.
>
> Hi, not trying to be argumentative, just surprised.  I thought
> parsing HTML with regexps was pretty easy.  Well, lexing HTML into
> tokens, I mean.

There's a lot of pretty darn ugly HTML out there my friend.  Here's a
semi-paranoid attempt to grab just the start of anchor tag:

/<\s*a[^>]*?href\s*=\s*(['"]?)[^'"]*\1?[^>]*>/i

Am I getting close yet?  No, the quotes are all wrong.  That would
fail to match an extremely common link like:

<a href="alert('You broke it!')">

I would try to fix that, but my brain has already melted and leaked
out my ear.  :)  I'm sure I made other mistakes too.

If you want to capture the name of the link too, this gets *much* worse!

James Edward Gray II
A402df36168b81b31c17adcbb5ae8cf4?d=identicon&s=25 Pistos Christou (pistos)
on 2006-03-07 21:12
James Gray wrote:
> Am I getting close yet?  No, the quotes are all wrong.  That would
> fail to match an extremely common link like:
>
> If you want to capture the name of the link too, this gets *much* worse!

I see what you're getting at: If you're trying to do
generally-applicable parsing, I suppose you're headed for a world of
hurt.  But all I've ever done is page- or site-specific scraping, and
never really considered it a big deal.  A few regexps here, a few .scans
there, and you're done...

Pistos
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-03-07 21:20
(Received via mailing list)
On Mar 7, 2006, at 2:12 PM, Pistos Christou wrote:

> A few regexps here, a few .scans there, and you're done...

Or you can load RubyfulSoup and call find() a few times.  About they
same effort, but a *lot* safer, eh?  ;)

James Edward Gray II
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 William James (Guest)
on 2006-03-08 02:51
(Received via mailing list)
Desireco wrote:
> Thanks.
>
>
> Zeljko Dakic
> http://www.dakic.com

class String
  def xtag(s)
    result = []
    scan( %r!
              < #{s}  (?: \s+ (  [^>]*  )  )? / >
              |
              < #{s}  (?: \s+ (  [^>]*  )  )? >
              ( .*? )  </ #{s} >
          !mix )  \
      { |unpaired, attr, data|   h = { }
        ( unpaired || attr || "" ).
        scan( %r{  ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2  |  ( \S+ )  )
                }x ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
	<a target="_blank" href="/support?hl=en">Help</a> |
97187d7e48d59c178c7af8eaaaa3857c?d=identicon&s=25 Gregor Kopp (Guest)
on 2006-03-08 10:06
(Received via mailing list)
gem install mechanize

require 'mechanize'
browser = WWW::Mechanize.new
url = "http://www.eineseite.de"
page = browser.get url
page.links.each do |link|
   puts "#{url}#{link.href}"
end


take also a look at html tokenizer from gems




Desireco schrieb:
97187d7e48d59c178c7af8eaaaa3857c?d=identicon&s=25 Gregor Kopp (Guest)
on 2006-03-08 10:09
(Received via mailing list)
Gregor Kopp schrieb:

>
> take also a look at html tokenizer from gems
>

or do a gem search html ;)
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-03-08 15:03
(Received via mailing list)
On Mar 7, 2006, at 7:48 PM, William James wrote:

>         ( unpaired || attr || "" ).
> DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }
>
> __END__
>   <a
>   href = "alert('Junior broke it!')" >foo bar</a>
>   <a
>   href = www.foo.bar >foo bar
>   </a>
> upcoming <A HREF="./">HTML 3.2 reference</A>. All the
> is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
> 	<a target="_blank" href="/support?hl=en">Help</a> |

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

James Edward Gray II
7264fb16beeea92b89bb42023738259d?d=identicon&s=25 Christian Neukirchen (Guest)
on 2006-03-08 18:37
(Received via mailing list)
"Bill Kelly" <billk@cts.com> writes:

>> to  do it.
>
> Hi, not trying to be argumentative, just surprised.  I thought parsing
> HTML with regexps was pretty easy.  Well, lexing HTML into tokens, I
> mean.

Lex, yes.  Scrape in general, no.

(And those who think that's BS, please have a look at REXML.)
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 William James (Guest)
on 2006-03-10 09:35
(Received via mailing list)
James Edward Gray II wrote:
> >           !mix )  \
> > end
> > is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
> > 	<a target="_blank" href="/support?hl=en">Help</a> |
>
> <a href="if (my_var > 5) { whatever() }">Javascript Link</a>

class String
  def xtag(str)
    result = []  ;  re =
     %r{ < #{str} (?: \s+ (  (?> [^>"/]* (?> "[^"]*" )? )*  ) )? }xi
    scan( %r{   #{re} / >   |   #{re}   >   ( .*? )  </ #{str} >
            }mix )  \
      { |unpaired, attr, data|   h = { }
        ( unpaired || attr || "" ).
        scan( %r{  ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2  |  ( \S+ )  )
                }mx ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts "-"*9; p atr['href']; puts txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF="./special/a.html">A</A>, with the attribute HREF.
	<a target="_blank" href="/support?hl=en">Help</a> |
<a href="if (my_var > 5) { whatever() }">Javascript Link</a>
<a   name = "foo-bar"
  href = "if (foo_bar > 14)
    { fluct() }"
  >Javascript "circumlocutory" Link</a>
This topic is locked and can not be replied to.