Hpricot elem index/position

casper_the_ghost · October 6, 2008, 6:12pm

Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example…

doc = Hpricot(“bob james dan”)
elem = doc.search(“a”)[1]
elem.start #=> 10 ( the first ‘<’ of the second a tag.)

and eventually the following would be good…

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners

casper_the_ghost · October 6, 2008, 10:55pm

On Oct 6, 10:19 am, “[email protected]”
[email protected] wrote:

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

– Mark.

casper_the_ghost · October 7, 2008, 10:00am

Hi Mark,

I’m writing a broken link reporting type tool. When I find a dodgy tag
I’d like to be able to relay the character position and or line number
to the user. Useful for debugging.

Thanks -h

casper_the_ghost · October 7, 2008, 4:30pm

Well, I suppose there are incorrectly formatted links too… I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you’re
talking about its syntax checking bit.

Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn’t reveal anything obvious… Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it…

casper_the_ghost · October 7, 2008, 3:57pm

On Oct 7, 3:58 am, “[email protected]”
[email protected] wrote:

Hi Mark,

I’m writing a broken link reporting type tool. When I find a dodgy tag
I’d like to be able to relay the character position and or line number
to the user. Useful for debugging.

So, are you really interested in broken links (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.

– Mark.

casper_the_ghost · October 7, 2008, 7:40pm

On Oct 7, 10:28 am, “[email protected]”
[email protected] wrote:

Well, I suppose there are incorrectly formatted links too… I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you’re
talking about its syntax checking bit.

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Here’s some starter code:

#----------------------------------------------

require ‘rubygems’
require ‘xml’

XML::Parser.default_line_numbers = true

html = <<END_HTML

test Here is a broken link. END_HTML

parser = XML::Parser.string html
doc = parser.parse

def broken?(link)
true
end

doc.find(“//a[@href]”).each do |link|
if broken?(link)
puts “Broken link to #{link[‘href’]} on line #{link.line_num}”
end
end

casper_the_ghost · October 8, 2008, 5:00am

On Oct 7, 1:36 pm, I wrote:

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.

– Mark.

casper_the_ghost · October 8, 2008, 9:08pm

Thanks for the hint towards to libxml-ruby! I didn’t even know it
existed. Can’t see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss…

cheers
-h