Html special characters. h() failure

ndavey · January 27, 2006, 9:21am

I was trying to convert a some text with the ® character it so it
replaced character \xAE with ®

h(@item.description) didn’t do anything. I need to use
@item.description.grep(/\xAE/,‘®’) for it to work.

I think the h() function should be able to do all the codes that are
available.

Regards Neil.

ndavey · January 27, 2006, 10:15am

Hi
I have text that I get from the user that is stored in the database
after escaping the html.

I want to display this text in the view with the markup (this is easy),
but I also want to display it in a alt_tag of an image where I would
like all markup stripped out.

I’m hoping someone can point me in the direction of an existing function
or helper method so I don’t have to reinvent the wheel.

Thanks in advance,

Francois

ndavey · January 27, 2006, 10:24am

http://railsmanual.org/module/ActionView::Helpers::TextHelper/strip_tags

Bob S.
http://www.railtie.net/

ndavey · January 27, 2006, 5:39pm

Yeah, someone posted yesterday that html_escape only replaces “<”, “>”,
and “&”. I
couldn’t believe that but went and verified it in the ERB sourcecode.
Seems a might bit
naive to me… it doesn’t even replace quotes (note to self: never use
ERB to replace
attribute values).

Anyway, the html_escape method is just a chained gsub… you could just
override that and
add a bunch more chars to the chain… and then share it with us all!

b

ndavey · January 27, 2006, 11:06am

redcloth’s html filter is very capable. you can strip all html tags,
or define which tags and attributes (like alt, src etc.) can remain.
but its’a private redcloth function. so either you will make it static
public or use redcloth filters. or just use the fragment below that I
extracted from redcloth. I think its self explanatory. the tags in
basic tags hash will be kept, all others will be removed.

(this is an extension to string method)

class String

BASIC_TAGS = {
‘a’ => [‘href’, ‘title’],
‘img’ => [‘src’, ‘alt’,
‘title’,‘align’,‘width’,‘height’,‘border’,‘class’],
‘br’ => [],
‘i’ => nil,
‘u’ => nil,
‘b’ => nil,
‘pre’ => nil,
‘kbd’ => nil,
‘code’ => [‘lang’],
‘cite’ => nil,
‘strong’ => nil,
‘em’ => nil,
‘ins’ => nil,
‘sup’ => nil,
‘sub’ => nil,
‘del’ => nil,
‘table’ => nil,
‘tr’ => nil,
‘td’ => [‘colspan’, ‘rowspan’],
‘th’ => nil,
‘ol’ => nil,
‘ul’ => nil,
‘li’ => nil,
‘p’ => nil,
‘h1’ => nil,
‘h2’ => nil,
‘h3’ => nil,
‘h4’ => nil,
‘h5’ => nil,
‘h6’ => nil,
‘blockquote’ => [‘cite’]
}

def self.clean_html!( text, tags = BASIC_TAGS )
    text.gsub!( /<!\[CDATA\[/, '' )
    text.gsub!( /<(\/*)(\w+)([^>]*)>/ ) do
        raw = $~
        tag = raw[2].downcase
        if tags.has_key? tag
            pcs = [tag]
            pcs << "rel=\"nofollow\"" if tag=='a'
            tags[tag].each do |prop|
                ['"', "'", ''].each do |q|
                    q2 = ( q != '' ? q : '\s' )
                    if raw[3] =~

/#{prop}\s*=\s*#{q}([^#{q2}]+)#{q}/i
attrv = $1
next if tag!=‘img’ and prop == ‘src’ and
attrv !~ /^http/
pcs << “#{prop}=”#{$1.gsub(’"’, ‘\"’)}""
break
end
end
end if tags[tag]
"<#{raw[1]}#{pcs.join " “}>”
else
" "
end
end
end

def self.clean_html( text, tags = BASIC_TAGS)
  str = text.dup
  clean_html!(str,tags)
  str
end

def clean_html( text, tags = BASIC_TAGS )
  self.class.clean_html!(text,tags)
end

end

ndavey · January 28, 2006, 12:39am

Francois B. wrote:

2006/1/27, Ben M. [email protected]:

Anyway, the html_escape method is just a chained gsub… you could just override that and
add a bunch more chars to the chain… and then share it with us all!

Hmm, that would be a bad idea. The purpose of html_escape is to
ESCAPE bad characters, not do translations. If you want that, look
into the textilize helper method, or textilize_without_paragraph.

I don’t follow you. I’m not talking about “translations”. I’m saying
that there are a bunch more potentially “bad”
characters than just gt, lt, and amp. The purpose of the html_escape
method is to escape any characters in the input
text to their appropriate x/html versions.

I’m simply saying that whoever wrote that method should be at least
escaping quotes… and probably apostrophes. Most
everything else one could live without, but as the OP pointed out, it
would be nice to have another version of (or an
option passed to) html_escape to do things like copyright (c),
registered (r), etc. That might me more
textilize-territory, but well, we’d probably need to get into wrassling
mode then. (that’s Amurican for we’d need to
argue some more)

For that matter, I would propose that the html_escape method be removed.
Instead, the default behavior of ERB should be
to replace any and all potentially problematic characters with the
appropriate entities. If, for some reason, the user
does not desire this, then they should use something like a “no_escape”
(“no”??) method to override the default
escaping. It would also be a good to have a “override for this file”
method so that you can just turn it off for e.g.
email templates.

I find it very amusing that the agile book counsels that you should
almost always have that “h” in your erb outs… it’s
easy to miss… make sure you don’t forget it! Doesn’t sound particuarly
DRY to me.

But actually, I’m thinking I don’t really want to stick with ERB too
long anyway… templating is so nineties… I’m
planning on spending some quality time with rexml, markaby, and xx.

b

ndavey · January 28, 2006, 6:39pm

Ben M. wrote:

I don’t follow you. I’m not talking about “translations”. I’m saying
that there are a bunch more potentially “bad” characters than just
gt, lt, and amp.
No there aren’t - the only other potentially bad character is ",
and that’s only ever (potentially) a problem in attribute values. If
you’re having problems with any other character, there’s a problem
with character set mismatches somewhere in your application.

The purpose of the html_escape method is to escape any characters
in the input text to their appropriate x/html versions.
Which it does, with the arguable exception of ".

Think about what would be needed for it to do any more than it does. In
order to be able to translate any of the other characters meaningfully
to the HTML escaped equivalent, you need to know which character set
you’re coming from, so you need to do a conversion to an unambiguous
base set anyway. For example: Á is the capital A acute letter.
In latin1, it’s 0xC1. In UTF-8, it’s 0xC381. If you thought you were
in latin1, but your data was actually utf-8, you’d end up with the
rather nice sequence ÃQ. You could hypothetically do:

def new_html_escape(str, charset)
h( Iconv.iconv(str, ‘utf-8’, charset))
end

But if you’ve got enough information to make that work, why not just
arrange for the data to be in the right character set in the first
place, and avoid overcomplicating what only needs to be a simple method?

Instead, the default behavior of ERB should be to replace any and all
potentially problematic characters with the appropriate entities. If,
for some reason, the user does not desire this, then they should use
something like a “no_escape” (“no”??) method to override the default
escaping.

Just… no. There are just as many cases where you don’t want
escaping to happen as those where you do. Think of all those <%= render
:partial => … %> and <%= link_to … %> that you’d have to turn
escaping off for. Just as non-DRY.

ndavey · January 28, 2006, 7:00pm

Ok, you make valid points… I take it all back… except that
html_escape should do
" too. We agree on that. And actually, I think ' would be
good too, since
that is a valid char for enclosing attributes.

b

ndavey · January 28, 2006, 12:09am

Hi !

2006/1/27, Ben M. [email protected]:

Anyway, the html_escape method is just a chained gsub… you could just override that and
add a bunch more chars to the chain… and then share it with us all!

Hmm, that would be a bad idea. The purpose of html_escape is to
ESCAPE bad characters, not do translations. If you want that, look
into the textilize helper method, or textilize_without_paragraph.

Hope that helps,
FranÃ§ois

ndavey · January 29, 2006, 1:07am

OMFG… I looked right at that but the gsub(/"/, “”") bit was
temporarily
invisible… I plead brain damage… I’m just going to crawl back in my
hole now…

:-#

b

ndavey · January 29, 2006, 12:13am

Ben M. wrote:

Yeah, someone posted yesterday that html_escape only replaces “<”, “>”,
and “&”. I couldn’t believe that but went and verified it in the ERB
sourcecode. Seems a might bit naive to me… it doesn’t even replace
quotes (note to self: never use ERB to replace attribute values).

Which version of ERB are you looking at? My copy (Ruby 1.8.2) does
replace quotes:

def html_escape(s)
s.to_s.gsub(/&/, “&”).gsub(/"/, “"”).
gsub(/>/, “>”).gsub(/</, “<”)
end

According to the Ruby CVS [1], html_escape has been unchanged for over
three years.

http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/ruby/lib/erb.rb

Phil

–
Philip R.
http://tzinfo.rubyforge.org/ – DST-aware timezone library for Ruby