YAML, UTF-8, TextMate, Notepad

This is not a question but a report on the difficulties I had and the
solution I found with respect to UTF-8, YAML::load, and Ruby/Rails.

Comments are appreciated.


I had been struggling for two days to get UTF-8 working in my Rails app.

I had/have a localization file, lib\locale\de.yml, that had iso-8859-1
encoding. I could not get that to display properly.

Marnen, quite correctly, suggested that I transit to UTF-8. Of course,
I had tried to do that but I could not get the YAML localization file to
load.

What I had done was load the ANSI (i.e. iso-8859-1) localization file
into Notepad, convert to UTF-8, and saved that file.

Then all my German (de.yml) localizations failed.

It turns out that Notepad places “\xEF\xBB\xBF” at the beginning of the
file to indicate that this is a YAML file.

These three bytes appear to screw up YAML::load

Gimme a break!

Note only does Notepad put in these indicator bytes … so does
TextMate.

In fact, TextMate will happily determine that your non-"\xEF\xBB\xBF"
file is a UTF-8 file and will automatically reinsert the indicator
bytes. I find this rather hysterical (not in a good way) since in
http://blog.macromates.com/2005/handling-encodings-utf-8/ one of the
authors of TextMate wrote “Property 3 turns out to be attractive because
it means we can heuristically recognize UTF-8 with a near 100% certainty
by checking if the file is valid. Some software think it’s a good idea
to embed a BOM (byte order mark) in the beginning of an UTF-8 file, but
it is not, because the file can already be recognized, and placing a BOM
in the beginning of a file means placing three bytes in the beginning of
the file which a program that use the file may not expect…”.

How thoughtful that TextMate does what the article says it should not
do. If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

In order to get YAML::Load to load the localization, I have to remove
the three indicator bytes. Yuck!

Once I did that, YAML loads happily.


If you store your locales in lib/locale and you use the
AVAILABLE_LOCALES idiom as suggested in
http://rails-i18n.org/wiki/pages/i18n-available_locales then you can use
this in config\initializers\available_locales.rb


#See http://guides.rubyonrails.org/i18n.html

# Get loaded locales conveniently

See http://rails-i18n.org/wiki/pages/i18n-available_locales

module I18n
class << self
def available_locales; backend.available_locales; end
end

module Backend
class Simple
def available_locales; translations.keys.collect { |l| l.to_s
}.sort; end end
end
end

You need to “force-initialize” loaded locales

I18n.backend.send(:init_translations)

AVAILABLE_LOCALES = I18n.backend.available_locales
RAILS_DEFAULT_LOGGER.debug “* Loaded locales:
#{AVAILABLE_LOCALES.inspect}”

#Shnelvar: Remove UTF-8 indicator bytes so that YAML::load works
AVAILABLE_LOCALES.each do |localization_name|
# localization_name is, e.g. “de”
localization_name_dot_yml = localization_name + ‘.yml’
localization_file_name =
File.join(‘lib/locale’,localization_name_dot_yml)
yaml_str = IO.read(localization_file_name)

utf_8__3_byte_indicator = "\xEF\xBB\xBF"
if yaml_str[0..2] == utf_8__3_byte_indicator
  yaml_str = yaml_str[3...yaml_str.size]
  File.open(localization_file_name,"w") { |f| f << yaml_str }
  puts localization_file_name + ' has had the UTF-8 indicator bytes

removed’
end
end


Suggestions and comments are welcome.

What I had done was load the ANSI (i.e. iso-8859-1) localization file
into Notepad, convert to UTF-8, and saved that file.

<…>

It turns out that Notepad places “\xEF\xBB\xBF” at the beginning of the
file to indicate that this is a YAML file.

This is not to indicate a YAML file (I doubt Notepad knows that YAML is
at all).
This is Byte-Order-Mark http://en.wikipedia.org/wiki/Byte-order_mark

Gimme a break!

Note only does Notepad put in these indicator bytes … so does
TextMate.
<…>
How thoughtful that TextMate does what the article says it should not
do. If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Regards,
Rimantas

http://rimantas.com/

How thoughtful that TextMate does what the article says it should not
do. �If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Yes … absolutely certain.

I use a hex editor to remove the BOM … resave.

I examine the file with another hex editor … the BOM is not there.

I go into TextMate … load the file … resave … and the BOM
reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Ralph S. wrote:

How thoughtful that TextMate does what the article says it should not
do. �If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Yes … absolutely certain.

I use a hex editor to remove the BOM … resave.

I examine the file with another hex editor … the BOM is not there.

I go into TextMate … load the file … resave … and the BOM
reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Is there a setting to save as “UTF-8 without BOM” or something?

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Marnen Laibow-Koser wrote:

Is there a setting to save as “UTF-8 without BOM” or something?

As I said earlier, if there is a setting, I can’t find it.

TextMate has things call "bundles. These are mini-applications tht can
be integrated into TextMate. Someone, somewhere may have figured out
how to do it.

What UTF-8-compliant editor do you use, Marnen?

Marnen Laibow-Koser wrote:

Is there a setting to save as “UTF-8 without BOM” or something?

As I said earlier, if there is a setting, I can’t find it.

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that’s it.

Regards,
Rimantas

http://rimantas.com/

Ralph S. wrote:

Marnen Laibow-Koser wrote:

Is there a setting to save as “UTF-8 without BOM” or something?

As I said earlier, if there is a setting, I can’t find it.

TextMate has things call "bundles. These are mini-applications tht can
be integrated into TextMate. Someone, somewhere may have figured out
how to do it.

What UTF-8-compliant editor do you use, Marnen?

I mostly use KomodoEdit, for whatever it’s worth; also sometimes jEdit,
NetBeans, TextWrangler, Eclipse/Aptana…

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Rimantas L. wrote:

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that’s it.

Oh, Geez, I feel like a complete idiot …

I am using “e” as the text editor … which the advertising says is
“textmate for windows.”

Sorry!

It is “e” that is saving BOM.