Forum: Ruby Hpricot elem index/position

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
6cf8286fcab5843f468d62f5e6fb22de?d=identicon&s=25 henryturnerlists@googlemail.com (Guest)
on 2008-10-06 18:12
(Received via mailing list)
Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners
134ea397777886d6f0aa992672a50eaa?d=identicon&s=25 Mark Thomas (Guest)
on 2008-10-06 22:55
(Received via mailing list)
On Oct 6, 10:19 am, "henryturnerli...@googlemail.com"
<henryturnerli...@googlemail.com> wrote:
>
> elem.length #=> 12
> elem.end #=> 21
>
> Any thoughts appreciated!
> Henners

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.
6cf8286fcab5843f468d62f5e6fb22de?d=identicon&s=25 henryturnerlists@googlemail.com (Guest)
on 2008-10-07 10:00
(Received via mailing list)
Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

Thanks -h
134ea397777886d6f0aa992672a50eaa?d=identicon&s=25 Mark Thomas (Guest)
on 2008-10-07 15:57
(Received via mailing list)
On Oct 7, 3:58 am, "henryturnerli...@googlemail.com"
<henryturnerli...@googlemail.com> wrote:
> Hi Mark,
>
> I'm writing a broken link reporting type tool. When I find a dodgy tag
> I'd like to be able to relay the character position and or line number
> to the user. Useful for debugging.

So, are you really interested in broken *links* (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.

-- Mark.
6cf8286fcab5843f468d62f5e6fb22de?d=identicon&s=25 henryturnerlists@googlemail.com (Guest)
on 2008-10-07 16:30
(Received via mailing list)
Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn't reveal anything obvious.. Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it..
134ea397777886d6f0aa992672a50eaa?d=identicon&s=25 Mark Thomas (Guest)
on 2008-10-07 19:40
(Received via mailing list)
On Oct 7, 10:28 am, "henryturnerli...@googlemail.com"
<henryturnerli...@googlemail.com> wrote:
> Well, I suppose there are incorrectly formatted links too... I was
> talking about correctly formatted links that point to a 400+ status
> code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Here's some starter code:

#----------------------------------------------

require 'rubygems'
require 'xml'

XML::Parser.default_line_numbers = true

html = <<END_HTML
  <html>
  <head><title>test</title></head>
  <body>
    Here is a <a href="http://brok.en">broken link.</a>
  </body>
  </html>
END_HTML

parser = XML::Parser.string html
doc = parser.parse

def broken?(link)
  true
end

doc.find("//a[@href]").each do |link|
  if broken?(link)
    puts "Broken link to #{link['href']} on line #{link.line_num}"
  end
end
134ea397777886d6f0aa992672a50eaa?d=identicon&s=25 Mark Thomas (Guest)
on 2008-10-08 05:00
(Received via mailing list)
On Oct 7, 1:36 pm, I wrote:
> Well, libxml stores the line number of every element. So you can
> extract all links, check them, and print out element.line_num for each
> one that fails the check.

Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.

-- Mark.
6cf8286fcab5843f468d62f5e6fb22de?d=identicon&s=25 henryturnerlists@googlemail.com (Guest)
on 2008-10-08 21:08
(Received via mailing list)
Thanks for the hint towards to libxml-ruby! I didn't even know it
existed. Can't see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss..

cheers
-h
This topic is locked and can not be replied to.