Forum: Ruby Hpricot elem index/position

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
(Guest)
on 2008-10-06 20:12
(Received via mailing list)
Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners
Mark T. (Guest)
on 2008-10-07 00:55
(Received via mailing list)
On Oct 6, 10:19 am, "removed_email_address@domain.invalid"
<removed_email_address@domain.invalid> wrote:
>
> elem.length #=> 12
> elem.end #=> 21
>
> Any thoughts appreciated!
> Henners

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.
(Guest)
on 2008-10-07 12:00
(Received via mailing list)
Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

Thanks -h
Mark T. (Guest)
on 2008-10-07 17:57
(Received via mailing list)
On Oct 7, 3:58 am, "removed_email_address@domain.invalid"
<removed_email_address@domain.invalid> wrote:
> Hi Mark,
>
> I'm writing a broken link reporting type tool. When I find a dodgy tag
> I'd like to be able to relay the character position and or line number
> to the user. Useful for debugging.

So, are you really interested in broken *links* (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.

-- Mark.
(Guest)
on 2008-10-07 18:30
(Received via mailing list)
Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn't reveal anything obvious.. Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it..
Mark T. (Guest)
on 2008-10-07 21:40
(Received via mailing list)
On Oct 7, 10:28 am, "removed_email_address@domain.invalid"
<removed_email_address@domain.invalid> wrote:
> Well, I suppose there are incorrectly formatted links too... I was
> talking about correctly formatted links that point to a 400+ status
> code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Here's some starter code:

#----------------------------------------------

require 'rubygems'
require 'xml'

XML::Parser.default_line_numbers = true

html = <<END_HTML
  <html>
  <head><title>test</title></head>
  <body>
    Here is a <a href="http://brok.en">broken link.</a>
  </body>
  </html>
END_HTML

parser = XML::Parser.string html
doc = parser.parse

def broken?(link)
  true
end

doc.find("//a[@href]").each do |link|
  if broken?(link)
    puts "Broken link to #{link['href']} on line #{link.line_num}"
  end
end
Mark T. (Guest)
on 2008-10-08 07:00
(Received via mailing list)
On Oct 7, 1:36 pm, I wrote:
> Well, libxml stores the line number of every element. So you can
> extract all links, check them, and print out element.line_num for each
> one that fails the check.

Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.

-- Mark.
(Guest)
on 2008-10-08 23:08
(Received via mailing list)
Thanks for the hint towards to libxml-ruby! I didn't even know it
existed. Can't see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss..

cheers
-h
This topic is locked and can not be replied to.