On Aug 6, 2009, at 11:44 AM, Brian C. wrote:
Hmm. Could you try setting replacing ‘LANG’ with ‘LC_ALL’ globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.
I bet the issue is this line in my .bashrc:
export LC_CTYPE=en_US.UTF-8
7-bit), but will fail if they are 8-bit.
Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.
I think I understand what you are saying here. You have a good point
that is would be annoying to have the Encoding of the JPEG you are
building up from ASCII-8BIT to UTF-8.
- Working with other people’s libraries.
Take REXML as an example. Suppose I decide I want to do this:
doc = REXML::Document.new(src)
Under 1.8, I could do this without worrying.
Really?
What did it do under Ruby 1.8 when fed an XML document that was UTF-16
encoded? Will it read it? When I do searches for content, will it
hand me UTF-16 or UTF-8? These are just some questions that jump to
my mind.
As you’ve said, about the best I can think of is to test it and find
out, only this is Ruby 1.8 I’m talking about here.
Let’s see how it works:
$ ruby -r rexml/document -e ‘REXML::Document.new(ARGF.read)’
utf16_with_bom.xml
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in parse': #<Iconv::InvalidCharacter: "\340 \250 \274 \347 \215 \257 \346 \265 \245 \347 \221 \241 \346 \234 \276 \345 \215 \257 \346 \265 \245 \342 \201 \203 \346 \275 \256 \347 \221 \245 \346 \271\264\343\260\257\347\215\257\346\265\245\347\221\241\346\234\276", ["\n"]> (REXML::ParseException) /usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:in
conv’
/usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:in decode' /usr/local/lib/ruby/1.8/rexml/source.rb:57:in
encoding=’
/usr/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:213:in pull' /usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:22:in
parse’
/usr/local/lib/ruby/1.8/rexml/document.rb:227:in build' /usr/local/lib/ruby/1.8/rexml/document.rb:43:in
initialize’
-e:1:in new' -e:1 ... "\n" Line: Position: Last 80 unconsumed characters: <sometag>Some Content</sometag> from /usr/local/lib/ruby/1.8/rexml/document.rb:227:in
build’
from /usr/local/lib/ruby/1.8/rexml/document.rb:43:in initialize' from -e:1:in
new’
from -e:1
Ah, it just tells me my data is invalid. It’s not though:
$ iconv -f UTF-16BE -t UTF-8 < utf16_with_bom.xml
<?xml version="1.0" encoding="UTF-16BE"?>
Some Content
Ruby 1.9 can read it:
$ ruby_dev -r rexml/document -e ‘puts
REXML::Document.new(ARGF.read.force_encoding(“BINARY”)).to_s’
utf16_with_bom.xml
<?xml version='1.0' encoding='UTF-16BE'?>
Some Content
It looks like it’s suppose to work in Ruby 1.8 too and I’ve just hit a
bug. At least, if I’m reading the source right. I had to check.
Anyway, the point of all this is that it really isn’t any easier, for
me, to reason about Ruby 1.8 encoding behavior. Ruby 1.9 didn’t
invent character encodings, it just started paying attention to them
as we all should have been doing all along. That’s all my opinion, of
course.
James Edward G. II