Htmltools incorrectly parsing HTML containing server-side ta

I’m using HTML Tools 1.09 to parse HTML that contains tags that are to
be processed by the web server. For example, here’s an image tag:

this is a
seperator

The <$DCGallery$> will be replaced by some text when returned to the
browser by the web server.

What I’m noticing is that HTMLTools doesn’t handle tags that contain an
such an embedded tag. It seems to make an attempt at correcting what
it sees as invalid HTML. So the above tag, after going through the
parser and having a new class added, using:

element.add_attribute(‘class’, ’ wide_content’)

results in the following tag:

<$DCGallery$>Separators/gtabseps.gif" alt=“this is
a seperator”>

The image tag is closed after the new class attribute and the
server-side tag is duplicated and contains the alt attribute from the
original image tag. Has anyone encountered such behavior?

I know that HTML Tools probably wasn’t built to handle HTML with
embedded server-side tags, but for this project I need to process HTML
before being served up by the web server. Shouldn’t HTML Tools ignore
tags found within the quotes of the src attribute’s value? Is there an
option or patch that might get HTML Tools to ignore tags found within
the values of tag attributes?

[email protected] wrote:

I’m using HTML Tools 1.09 to parse HTML that contains tags that are to
be processed by the web server. For example, here’s an image tag:

this is a
seperator

Is this valid html? From another thread:

On 7/25/06, [email protected] [email protected] wrote:

I’m using HTML Tools 1.09 to parse HTML that contains tags that are to
be processed by the web server. For example, here’s an image tag:

this is a seperator

I’m not familiar with HTML Tools but it’s probably not reasonable to
expect
a html tool to parse bad html like that.

My lazy-arse solution would probably be to replace server tags
<$([^>])$>
with something insanely unlikely, like {{{{{$1}}}}}. Then do whatever
you
want with HTML Tools, then once you’re done replace the {{{{{ }}}}} with
angle brackets again.

Assuming curly brackets are ok with HTML Tools :slight_smile:

;Daniel

William J. wrote:

[email protected] wrote:

I’m using HTML Tools 1.09 to parse HTML that contains tags that are to
be processed by the web server. For example, here’s an image tag:

this is a
seperator

Is this valid html?

Thank you for this information. I did a bit more research and now
believe that this is not valid HTML. Read on…

Unfortunately, escaping is not an option since the HTML files that are
being parsed are being output from another closed system.

The question is: can HTML Tools be told to ignore “<” and “>” inside of
attribute values? Or is there another HTML parser for Ruby that would
handle this?

Alternatively, is there some method for finding these characters within
attribute values and escaping them before parsing by Ruby and then
un-escaping them after parsing (so that the server can perform the
required processing of these PHP-like tags).

sutch wrote:

Thank you for this information. I did a bit more research and now

$ echo ‘’ | xmllint -
Alternatively, is there some method for finding these characters within
attribute values and escaping them before parsing by Ruby and then
un-escaping them after parsing (so that the server can perform the
required processing of these PHP-like tags).

Perhaps this will work.

str = <<HERE

this is a separator HERE

We will split the html string into an array of strings.

Each member of the array will be an html comment, an

html tag, or plain text.

re = %r{ ( |
< (?:
[^<>"] +
|
" (?: \. | [^\"]+ ) * "
) *
>
) }xm

str.split( re ).each { |x|
if “<” == x[0,1] && “<!” != x[0,2]
# Since > is o.k., change only <.
x[1…-2] = x[1…-2].gsub( /</, “<” )
end

print x
}