Ruby "htmlentities" replacement: code review please!


#1

Hi Railers,

For some time now I’ve been looking for a decent Rails equivalent of
PHP’s
“htmlentities” command, because ERB’s html_escape (or more commonly
called as
just “h”, eg. <%=h @somevariable %> ) just doesn’t go far enough for me.

Back in PHP land, I actually had an extended version of the htmlentities
command to deal with all kinds of crazy characters that appear if you
copy
and paste into a CMS from Word. So anyway, given the apparent lack of a
function to do this in Rails, and because I’m not particularly impressed
with
ERB’s html_escape, I decided to do something about it and roll my own.

Here’s my code: http://rafb.net/paste/results/9lI1hc62.html

It defines a htmlentities function, designed to be included in a helper,
or in
your environment (that’s what I’m doing with it). Then it overrides the
h
function to call htmlentities instead of ERB’s html_escape, hopefully
meaning
you can just drop this in and your app will start using it.

I’m offering this up for several reasons:

  1. Because I know I’m not the only person who wants such functionality,
    and
    can’t seem to find anyone else who has written this yet.

  2. For code review - is there anything wrong with it? Anything missing?
    Anything that could be done more efficiently? I’m hardly a Ruby / Rails
    guru, so would really appreciate some second opinions here!

I haven’t used this on a production site yet (only wrote it this
morning) but
my thinking is that this code coupled with Rails’ caching should solve
the
problem in a nice and efficient manner. This probably isn’t the ideal
solution, as in for sites in non-western alphabets (eg. Japanese,
Hebrew,
Arabic etc.) it probably doesn’t help, but the thinking is that this
command
should be at least as good as PHP’s htmlentities command for most
Western
alphabet users… I hope! :slight_smile:

So, any comments? Feel free to use the code anywhere you like, but
possibly
just wait until a few other folks have looked it over, just in case
there’s
something heinously wrong with it! It comes with no guarantee of
suitability
for any purpose whatsoever, and you use it entirely at your own risk.

Regardless, if folks think it’s useful, I’ll put it online somewhere
more
permanent than RAFB NoPaste!

Cheers,

~Dave

Dave S.
Rent-A-Monkey Website Development
Web: http://www.rentamonkey.com/


#2

On 1/18/06, Dave S. removed_email_address@domain.invalid wrote:

ERB’s html_escape, I decided to do something about it and roll my own.

  1. Because I know I’m not the only person who wants such functionality, and
    Arabic etc.) it probably doesn’t help, but the thinking is that this command

Rails mailing list
removed_email_address@domain.invalid
http://lists.rubyonrails.org/mailman/listinfo/rails

You might want to post this to rails-core.


Kyle M.
Chief Technologist
E Factor Media // FN Interactive
removed_email_address@domain.invalid
1-866-263-3261


#3

On 18.1.2006, at 17:05, Dave S. wrote:

For some time now I’ve been looking for a decent Rails equivalent
of PHP’s
“htmlentities” command, because ERB’s html_escape (or more commonly
called as
just “h”, eg. <%=h @somevariable %> ) just doesn’t go far enough
for me.

Why not? If you have your encodings set-up correctly in web pages or
http headers, everything should just work. What doesn’t work?

Besides, your solution assumes one specific single-byte encoding
(there are many single-byte western encodings). This is very wrong…

izidor


#4

On Thursday 19 Jan 2006 11:05, Izidor J. wrote:

Why not? If you have your encodings set-up correctly in web pages or
http headers, everything should just work. What doesn’t work?
Besides, your solution assumes one specific single-byte encoding
(there are many single-byte western encodings). This is very wrong…

OK, fair enough, I do see your point - maybe this is the wrong approach
to the
problem, and a throwback to my days of PHP.

I nearly always use iso-8859-1 for my sites, which means that when
people
paste text into any kind of CMS from Word (with it’s curly quotes, long
hyphens etc.) unless I strip those out or replace them, the page
contains
invalid characters.

I guess I should just use UTF-8 instead and ditch my allegiance to
iso-8859-1.

See, this is why I asked for the code review… thanks for the shove in
the
right direction, and d’oh that for some stupid reason I didn’t think of
that
sooner, guess I was just stuck in my old ways there! :smiley:

~Dave

Dave S.
Rent-A-Monkey Website Development
Web: http://www.rentamonkey.com/


#5

Dave S. wrote:

I guess I should just use UTF-8 instead and ditch my allegiance to iso-8859-1.

You’d probably be better off shifting to CP1252 (or 1250, come to think
of it) - it’s pretty much ISO-8859-1 with multibyte tacked on the side,
so your existing data should be OK.