Crazy gsub/regex scheme - can this be done better?

weyus · August 11, 2006, 9:47pm

All,

I have a method (that I believe to be working) that will take arbitrary
HTML and quote all of the non-quoted attributes (so href=junk would
become href=“junk”).

The method is below. As you can see it’s a gsub within a gsub, where
the first gsub regex basically identifies any tag that has at least one
unquoted attribute, and then the inner gsub fixes ALL of the quoted
attributes.

QUESTION: Is there a way to do this with one gsub, or is this scheme
really the only valid way to handle it?

Thanks,
Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one “bad” attribute value pair
#The “inner” regex is to actually fix ALL of the “bad” attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag
name, followed by whitespace
(?:[a-zA-Z0-9]+?=([’"])(.?)\1\s)? #Any number of valid
attribute-value pairs (attribute=“value”), not-greedy
[a-zA-Z0-9]+?=[^"’\s>]+\s? #An unquoted
attribute-value pair (attribute=value)
.?> #Rest of tag
/mix) { |s|
s.gsub(/(\s+[a-zA-Z0-9]+?=)([^"’\s>]+)(\s?)/) {
|sub_s| “#{$1}”#{$2}"#{$3}" }
}
end

weyus · August 11, 2006, 11:17pm

Wes G. wrote:

All,

I have a method (that I believe to be working) that will take arbitrary
HTML and quote all of the non-quoted attributes (so href=junk would
become href=“junk”).

You might want to just look into using Tidy, hpricot
or something that should fix broken HTML to compliant
XHTML. They would probably do this for you.

The method is below. As you can see it’s a gsub within a gsub, where
the first gsub regex basically identifies any tag that has at least one
unquoted attribute, and then the inner gsub fixes ALL of the quoted
attributes.

QUESTION: Is there a way to do this with one gsub, or is this scheme
really the only valid way to handle it?

Thanks,
Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one “bad” attribute value pair
#The “inner” regex is to actually fix ALL of the “bad” attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag
name, followed by whitespace
(?:[a-zA-Z0-9]+?=([’"])(.?)\1\s)? #Any number of valid
attribute-value pairs (attribute=“value”), not-greedy
[a-zA-Z0-9]+?=[^"’\s>]+\s? #An unquoted
attribute-value pair (attribute=value)
.?> #Rest of tag
/mix) { |s|
s.gsub(/(\s+[a-zA-Z0-9]+?=)([^"’\s>]+)(\s?)/) {
|sub_s| “#{$1}”#{$2}"#{$3}" }
}
end

weyus · August 11, 2006, 11:37pm

The problem with those kind of parsers (I’m using Rubyful Soup to some
degree) is that they try to “fix” the HTML for you and sometimes cause
it to be rendered incorrectly compared to the original “incorrect”
implementation.

WG

weyus · August 12, 2006, 1:49am

[ INSANE COMMENT: I just want to say that the black magic that is
regexes is so powerful and alluring that I can’t resist it and at the
same time so repulsive that I never want to do it again. ]

Update - my original scheme would fail when there was an attribute like

content=“text/html; charset=UTF-8”

because the latter half would be seen as needing to be charset=“UTF-8”.

Thus, I became intimate with negative zero-width lookahead.

Here’s what I believe to be a more correct solution (I apologize for the
formatting but I wanted to leave the comments in here).

Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one “bad” attribute value pair
#The “inner” regex is to actually fix ALL of the “bad” attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag name,
followed by whitespace
(?:[a-zA-Z-]+?=([’"]).?\1\s)? #Any number of valid
attribute-value pairs (attribute=“value”), not-greedy
[a-zA-Z-]+?=[^"’\s>]+\s? #An unquoted
attribute-value pair (attribute=value)
.?> #Rest of tag
/mix) { |s| #For each tag gotten
from the first regex, globally substitute into it based on…
s.gsub(/(\s+[a-zA-Z-]+=) #Attribute name
(?!([’"])[^’"]?\2[\s>]) #If the value
looks like “stuff”, then don’t match, it’s fine
(?![^’"]*?[’"][\s>]) #If the value
looks like stuff", then don’t match, it must be the tail end of another
attribute-value pair
([^’"\s>]+) #Get the
no-whitespace, no-’>’, no quote text
/mix) { |sub_s| “#{$1}”#{$3}"" }
#Substitute attribute name=“attribute value”
}
end