Forum: Ruby on Rails html special characters. h() failure.

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Neil D. (Guest)
on 2006-01-27 10:21
(Received via mailing list)
I was trying to convert a some text with the (r) character it so it
replaced character \xAE with ®

h(@item.description) didn't do anything.  I need to use
@item.description.grep(/\xAE/,'®') for it to work.

I think the h() function should be able to do all the codes that are
available.

Regards Neil.
Francois P. (Guest)
on 2006-01-27 11:15
(Received via mailing list)
Hi
I have text that I get from the  user that is stored in the database
after escaping the html.

I want to display this text in the view with the markup (this is easy),
but I also want to display it in a alt_tag of an image where I would
like all markup stripped out.

I'm hoping someone can point me in the direction of an existing function
or helper method so I don't have to reinvent the wheel.

Thanks in advance,

Francois
Bob S. (Guest)
on 2006-01-27 11:24
(Received via mailing list)
Onur T. (Guest)
on 2006-01-27 12:06
(Received via mailing list)
redcloth's html filter is very capable. you can strip all html tags,
or define which tags and attributes (like alt, src etc.) can remain.
but its'a private redcloth function. so either you will make it static
public or use redcloth filters. or just use the fragment below that I
extracted from redcloth. I think its self explanatory. the tags in
basic tags hash will be kept, all others will be removed.

(this is an extension to string method)

class String

BASIC_TAGS = {
        'a' => ['href', 'title'],
        'img' => ['src', 'alt',
'title','align','width','height','border','class'],
        'br' => [],
        'i' => nil,
        'u' => nil,
        'b' => nil,
        'pre' => nil,
        'kbd' => nil,
        'code' => ['lang'],
        'cite' => nil,
        'strong' => nil,
        'em' => nil,
        'ins' => nil,
        'sup' => nil,
        'sub' => nil,
        'del' => nil,
        'table' => nil,
        'tr' => nil,
        'td' => ['colspan', 'rowspan'],
        'th' => nil,
        'ol' => nil,
        'ul' => nil,
        'li' => nil,
        'p' => nil,
        'h1' => nil,
        'h2' => nil,
        'h3' => nil,
        'h4' => nil,
        'h5' => nil,
        'h6' => nil,
        'blockquote' => ['cite']
    }

    def self.clean_html!( text, tags = BASIC_TAGS )
        text.gsub!( /<!\[CDATA\[/, '' )
        text.gsub!( /<(\/*)(\w+)([^>]*)>/ ) do
            raw = $~
            tag = raw[2].downcase
            if tags.has_key? tag
                pcs = [tag]
                pcs << "rel=\"nofollow\"" if tag=='a'
                tags[tag].each do |prop|
                    ['"', "'", ''].each do |q|
                        q2 = ( q != '' ? q : '\s' )
                        if raw[3] =~
/#{prop}\s*=\s*#{q}([^#{q2}]+)#{q}/i
                            attrv = $1
                            next if tag!='img' and prop == 'src' and
attrv !~ /^http/
                            pcs << "#{prop}=\"#{$1.gsub('"', '\\"')}\""
                            break
                        end
                    end
                end if tags[tag]
                "<#{raw[1]}#{pcs.join " "}>"
            else
                " "
            end
        end
    end

    def self.clean_html( text, tags = BASIC_TAGS)
      str = text.dup
      clean_html!(str,tags)
      str
    end

    def clean_html( text, tags = BASIC_TAGS )
      self.class.clean_html!(text,tags)
    end
end
Ben M. (Guest)
on 2006-01-27 18:39
(Received via mailing list)
Yeah, someone posted yesterday that html_escape only replaces "<", ">",
and "&". I
couldn't believe that but went and verified it in the ERB sourcecode.
Seems a might bit
naive to me.... it doesn't even replace quotes (note to self: never use
ERB to replace
attribute values).

Anyway, the html_escape method is just a chained gsub... you could just
override that and
add a bunch more chars to the chain... and then share it with us all!
;-)

b
Francois B. (Guest)
on 2006-01-28 01:09
(Received via mailing list)
Hi !

2006/1/27, Ben M. <removed_email_address@domain.invalid>:
> Anyway, the html_escape method is just a chained gsub... you could just override that 
and
> add a bunch more chars to the chain... and then share it with us all! ;-)

Hmm, that would be a bad idea.  The purpose of html_escape is to
ESCAPE bad characters, not do translations.  If you want that, look
into the textilize helper method, or textilize_without_paragraph.

Hope that helps,
François
Ben M. (Guest)
on 2006-01-28 01:39
(Received via mailing list)
Francois B. wrote:
> 2006/1/27, Ben M. <removed_email_address@domain.invalid>:
>>Anyway, the html_escape method is just a chained gsub... you could just override that 
and
>>add a bunch more chars to the chain... and then share it with us all! ;-)
>
> Hmm, that would be a bad idea.  The purpose of html_escape is to
> ESCAPE bad characters, not do translations.  If you want that, look
> into the textilize helper method, or textilize_without_paragraph.
>

I don't follow you. I'm not talking about "translations". I'm saying
that there are a bunch more potentially "bad"
characters than just gt, lt, and amp. The purpose of the html_escape
method is to *escape* any characters in the input
text to their appropriate x/html versions.

I'm simply saying that whoever wrote that method should be *at least*
escaping quotes... and probably apostrophes. Most
everything else one could live without, but as the OP pointed out, it
would be nice to have another version of (or an
option passed to) html_escape to do things like copyright (c),
registered (r), etc. That might me more
textilize-territory, but well, we'd probably need to get into wrassling
mode then. (that's Amurican for we'd need to
argue some more)

For that matter, I would propose that the html_escape method be removed.
Instead, the default behavior of ERB should be
to replace any and all potentially problematic characters with the
appropriate entities. If, for some reason, the user
does not desire this, then they should use something like a "no_escape"
("no"??) method to override the default
escaping. It would also be a good to have a "override for this file"
method so that you can just turn it off for e.g.
email templates.

I find it very amusing that the agile book counsels that you should
almost always have that "h" in your erb outs... it's
easy to miss... make sure you don't forget it! Doesn't sound particuarly
DRY to me.

But actually, I'm thinking I don't really want to stick with ERB too
long anyway... templating is so nineties... I'm
planning on spending some quality time with rexml, markaby, and xx.

b
Alex Y. (Guest)
on 2006-01-28 19:39
(Received via mailing list)
Ben M. wrote:
> I don't follow you. I'm not talking about "translations". I'm saying
>  that there are a bunch more potentially "bad" characters than just
> gt, lt, and amp.
No there aren't - the only other potentially bad character is &quot;,
and that's only ever (potentially) a problem in attribute values.  If
you're having problems with *any other* character, there's a problem
with character set mismatches somewhere in your application.

> The purpose of the html_escape method is to *escape* any characters
> in the input text to their appropriate x/html versions.
Which it does, with the arguable exception of &quot;.

Think about what would be needed for it to do any more than it does.  In
order to be able to translate any of the other characters meaningfully
to the HTML escaped equivalent, you need to know which character set
you're coming from, so you need to do a conversion to an unambiguous
base set anyway.  For example:  &Aacute; is the capital A acute letter.
In latin1, it's 0xC1.  In UTF-8, it's 0xC381.  If you thought you were
in latin1, but your data was actually utf-8, you'd end up with the
rather nice sequence &Atilde;Q.  You could hypothetically do:

def new_html_escape(str, charset)
   h( Iconv.iconv(str, 'utf-8', charset))
end

But if you've got enough information to make that work, why not just
arrange for the data to be in the right character set in the first
place, and avoid overcomplicating what only needs to be a simple method?

> Instead, the default behavior of ERB should be to replace any and all
> potentially problematic characters with the appropriate entities. If,
> for some reason, the user does not desire this, then they should use
> something like a "no_escape" ("no"??) method to override the default
> escaping.

Just...  no.  There are just as many cases where you *don't* want
escaping to happen as those where you do.  Think of all those <%= render
:partial => ... %> and <%= link_to ... %> that you'd have to turn
escaping off for.  Just as non-DRY.
Ben M. (Guest)
on 2006-01-28 20:00
(Received via mailing list)
Ok, you make valid points... I take it all back... except that
html_escape should do
&quot; too. We agree on that. :-) And actually, I think &apos; would be
good too, since
that is a valid char for enclosing attributes.

b
Philip R. (Guest)
on 2006-01-29 01:13
(Received via mailing list)
Ben M. wrote:
> Yeah, someone posted yesterday that html_escape only replaces "<", ">",
> and "&". I couldn't believe that but went and verified it in the ERB
> sourcecode. Seems a might bit naive to me.... it doesn't even replace
> quotes (note to self: never use ERB to replace attribute values).

Which version of ERB are you looking at? My copy (Ruby 1.8.2) does
replace quotes:

def html_escape(s)
   s.to_s.gsub(/&/, "&amp;").gsub(/\"/, "&quot;").
     gsub(/>/, "&gt;").gsub(/</, "&lt;")
end

According to the Ruby CVS [1], html_escape has been unchanged for over
three years.

   1. http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/ruby/lib/erb.rb

Phil

--
Philip R.
http://tzinfo.rubyforge.org/ -- DST-aware timezone library for Ruby
Ben M. (Guest)
on 2006-01-29 02:07
(Received via mailing list)
OMFG.... I looked right at that but the gsub(/\"/, "&quot;") bit was
temporarily
invisible... I plead brain damage... I'm just going to crawl back in my
hole now...

:-#

b
This topic is locked and can not be replied to.