It’s Rails 2.3, Ruby 1.8.7
Someone before I came along coded a bunch of character sanitations in
our
app.
These had no documentation and no unit tests.
Our app is in English and all our clients use latin alphabet.
The sanitation is for the infamous MS Office “smart quotes”, that is:
left + right single quote ( ‘ ) ( ’ ) => ascii single quote ' ( ’ )
left + right double quote ( “ ) ( ” ) => ascii quotation mark " ( " )
(“quote” Unicode Characters, Symbols & Entities Search | AmpWhat)
The sanitation code looks like this:
str.gsub! "\342\200\230", "'"
str.gsub! "\342\200\231", "'"
str.gsub! "\342\200\234", '"'
str.gsub! "\342\200\235", '"'
My knowledge of encodings and such is above elementary, but in this
field
that really means nothing.
Question 1.
I don’t understand what “\342\200\230” is.
From online research I find it is the “raw ASCII Octal representation”.
What does that mean?
Question 2.
A user enters text into a form and submits it. It goes through the
controller, the model, and eventually the DB. At which stage is a
Unicode
character converted to the above backslash representation? Can this be
tweaked for Rails 2.3 apps?
Question 3.
How can I write a test for the above sanitations. If I directly paste
the
Unicode characters into my UTF8 encoded sanitation_test.rb the
assertions
fail. They expect Unicode sequences. This feels weird and it means my
assertion must look like this:
assert “Bob\342\200\230s house is red”.replace_smart_quotes == “Bob’s
house
is red”
Something about this doesn’t feel right. I guess if the answer to 2. is
reliable and permanent.
I understand that the answers to these questions may be long and
complex. I
will happily read any online resource you send my way that might help me
learn more about the subject.
Thanks,
P.S.
Upgrading to a later Ruby is not an option.