Hey,
Trying to find the String index of an Hpricot::Elem within its doc.
For example…
doc = Hpricot(“bobjamesdan”)
elem = doc.search(“a”)[1]
elem.start #=> 10 ( the first ‘<’ of the second a tag.)
and eventually the following would be good…
elem.length #=> 12
elem.end #=> 21
Any thoughts appreciated!
Henners
On Oct 6, 10:19 am, “[email protected]”
[email protected] wrote:
elem.length #=> 12
elem.end #=> 21
Any thoughts appreciated!
Henners
My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.
– Mark.
Hi Mark,
I’m writing a broken link reporting type tool. When I find a dodgy tag
I’d like to be able to relay the character position and or line number
to the user. Useful for debugging.
Thanks -h
Well, I suppose there are incorrectly formatted links too… I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you’re
talking about its syntax checking bit.
Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn’t reveal anything obvious… Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it…
On Oct 7, 3:58 am, “[email protected]”
[email protected] wrote:
Hi Mark,
I’m writing a broken link reporting type tool. When I find a dodgy tag
I’d like to be able to relay the character position and or line number
to the user. Useful for debugging.
So, are you really interested in broken links (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.
– Mark.
On Oct 7, 10:28 am, “[email protected]”
[email protected] wrote:
Well, I suppose there are incorrectly formatted links too… I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you’re
talking about its syntax checking bit.
Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.
Here’s some starter code:
#----------------------------------------------
require ‘rubygems’
require ‘xml’
XML::Parser.default_line_numbers = true
html = <<END_HTML
test
Here is a
broken link.
END_HTML
parser = XML::Parser.string html
doc = parser.parse
def broken?(link)
true
end
doc.find(“//a[@href]”).each do |link|
if broken?(link)
puts “Broken link to #{link[‘href’]} on line #{link.line_num}”
end
end
On Oct 7, 1:36 pm, I wrote:
Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.
Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.
– Mark.
Thanks for the hint towards to libxml-ruby! I didn’t even know it
existed. Can’t see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss…
cheers
-h