[Repost, with Formatting] Trying to understand unicode character entry, goes into postgres DB backi

Steven_GSHarms · January 5, 2008, 9:19pm

Apologies on unformatted send previously, i hit Enter and the web UI
posted, to my chagrin.

Examine the Unicode standard’s code page collection for “Latin
small letter a with macron”.
Nets U0100.pdf
“Latin small letter a with macron” appears on chart as 0101. This
is a hexidemial number which points to U+0101 as its code point.
Converting 0101 to decimal gets you 257, this is the same as the HTML
entity code.
HTML code point is 257. That is &257; gives you &257; != 325. OK, so I
can link this guy back to the Unicode source. But here’s the question,
what’s up with the two broken values.
Put &257; character into a view via Rails that is back-ended by a
PostGres database.
Using script/console, write the collection of models that contain
this accented character to a YAML file.
“Latin small letter a with macron” is stored in a YAML dump of
accented charcters as: \xC4\x81
Hm, OK that’s a start. Somehow 0101 or 257 is linked to C4 81 Let’s
convert those two to decimal and see if correlation becomes clear ( I
know, BTW, the database that holds that entry is in UTF-8).
C4: 196
81: 129
196+129=325 != 0101. Hm, look at documentation.
Be stumped.

I’m working an application up that works with foreign languages and
I’m trying to make it easy to enter accented characters. I saved some
base data that I entered as a fixture ( so that I could re-load it as
a sample when needed ) and I noticed that in this yaml file my
accented characters are in this unusual \x##\x## format that bears
little link to the code-points that I’ve seen before in code point
charts.

I’ve always been scared to jump into the “How does Unicode work,
really” discussion, but maybe it’s time that I try to sort it out a
bit.

Doubtless people from a more multi-lingual environment probably
understand this much better than those of us in North America, so I’m
hoping this is a lost easier than I think!