YAML, UTF-8, TextMate, Notepad

This is not a question but a report on the difficulties I had and the
solution I found with respect to UTF-8, YAML::load, and Ruby/Rails.

Comments are appreciated.


I had been struggling for two days to get UTF-8 working in my Rails app.

I had/have a localization file, lib\locale\de.yml, that had iso-8859-1
encoding. I could not get that to display properly.

Marnen, quite correctly, suggested that I transit to UTF-8. Of course,
I had tried to do that but I could not get the YAML localization file to
load.

What I had done was load the ANSI (i.e. iso-8859-1) localization file
into Notepad, convert to UTF-8, and saved that file.

Then all my German (de.yml) localizations failed.

It turns out that Notepad places “\xEF\xBB\xBF” at the beginning of the
file to indicate that this is a YAML file.

These three bytes appear to screw up YAML::load

Gimme a break!

Note only does Notepad put in these indicator bytes … so does
TextMate.

In fact, TextMate will happily determine that your non-"\xEF\xBB\xBF"
file is a UTF-8 file and will automatically reinsert the indicator
bytes. I find this rather hysterical (not in a good way) since in
http://blog.macromates.com/2005/handling-encodings-utf-8/ one of the
authors of TextMate wrote “Property 3 turns out to be attractive because
it means we can heuristically recognize UTF-8 with a near 100% certainty
by checking if the file is valid. Some software think it’s a good idea
to embed a BOM (byte order mark) in the beginning of an UTF-8 file, but
it is not, because the file can already be recognized, and placing a BOM
in the beginning of a file means placing three bytes in the beginning of
the file which a program that use the file may not expect…”.

How thoughtful that TextMate does what the article says it should not
do. If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

In order to get YAML::Load to load the localization, I have to remove
the three indicator bytes. Yuck!

Once I did that, YAML loads happily.


If you store your locales in lib/locale and you use the
AVAILABLE_LOCALES idiom as suggested in
http://rails-i18n.org/wiki/pages/i18n-available_locales then you can use
this in config\initializers\available_locales.rb


#See http://guides.rubyonrails.org/i18n.html

# Get loaded locales conveniently

See http://rails-i18n.org/wiki/pages/i18n-available_locales

module I18n
class << self
def available_locales; backend.available_locales; end
end

module Backend
class Simple
def available_locales; translations.keys.collect { |l| l.to_s
}.sort; end end
end
end

You need to “force-initialize” loaded locales

I18n.backend.send(:init_translations)

AVAILABLE_LOCALES = I18n.backend.available_locales
RAILS_DEFAULT_LOGGER.debug “* Loaded locales:
#{AVAILABLE_LOCALES.inspect}”

#Shnelvar: Remove UTF-8 indicator bytes so that YAML::load works
AVAILABLE_LOCALES.each do |localization_name|
# localization_name is, e.g. “de”
localization_name_dot_yml = localization_name + ‘.yml’
localization_file_name =
File.join(‘lib/locale’,localization_name_dot_yml)
yaml_str = IO.read(localization_file_name)

utf_8__3_byte_indicator = "\xEF\xBB\xBF"
if yaml_str[0..2] == utf_8__3_byte_indicator
  yaml_str = yaml_str[3...yaml_str.size]
  File.open(localization_file_name,"w") { |f| f << yaml_str }
  puts localization_file_name + ' has had the UTF-8 indicator bytes

removed’
end
end


Suggestions and comments are welcome.

What I had done was load the ANSI (i.e. iso-8859-1) localization file
into Notepad, convert to UTF-8, and saved that file.

<…>

It turns out that Notepad places “\xEF\xBB\xBF” at the beginning of the
file to indicate that this is a YAML file.

This is not to indicate a YAML file (I doubt Notepad knows that YAML is
at all).
This is Byte-Order-Mark http://en.wikipedia.org/wiki/Byte-order_mark

Gimme a break!

Note only does Notepad put in these indicator bytes … so does
TextMate.
<…>
How thoughtful that TextMate does what the article says it should not
do. If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Regards,
Rimantas

http://rimantas.com/

How thoughtful that TextMate does what the article says it should not
do. �If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Yes … absolutely certain.

I use a hex editor to remove the BOM … resave.

I examine the file with another hex editor … the BOM is not there.

I go into TextMate … load the file … resave … and the BOM
reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Ralph S. wrote:

How thoughtful that TextMate does what the article says it should not
do. �If there is a way to turn off that behavior, I can’t find it.
Maybe there’s a TextMate bundle … who knows?

Really? Never saw Textmate to do that. Are you sure you did not
just loaded file saved elsewhere with BOM?

Yes … absolutely certain.

I use a hex editor to remove the BOM … resave.

I examine the file with another hex editor … the BOM is not there.

I go into TextMate … load the file … resave … and the BOM
reappears.

This only happens if TextMate detects UTF-8 characters in the file.

Is there a setting to save as “UTF-8 without BOM” or something?

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Marnen Laibow-Koser wrote:

Is there a setting to save as “UTF-8 without BOM” or something?

As I said earlier, if there is a setting, I can’t find it.

TextMate has things call "bundles. These are mini-applications tht can
be integrated into TextMate. Someone, somewhere may have figured out
how to do it.

What UTF-8-compliant editor do you use, Marnen?

Marnen Laibow-Koser wrote:

Is there a setting to save as “UTF-8 without BOM” or something?

As I said earlier, if there is a setting, I can’t find it.

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that’s it.

Regards,
Rimantas

http://rimantas.com/

Ralph S. wrote:

Marnen Laibow-Koser wrote:

Is there a setting to save as “UTF-8 without BOM” or something?

As I said earlier, if there is a setting, I can’t find it.

TextMate has things call "bundles. These are mini-applications tht can
be integrated into TextMate. Someone, somewhere may have figured out
how to do it.

What UTF-8-compliant editor do you use, Marnen?

I mostly use KomodoEdit, for whatever it’s worth; also sometimes jEdit,
NetBeans, TextWrangler, Eclipse/Aptana…

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Rimantas L. wrote:

Textmate does not save BOM for UTF-8 files.

Just choose save as, utf-8 and that’s it.

Oh, Geez, I feel like a complete idiot …

I am using “e” as the text editor … which the advertising says is
“textmate for windows.”

Sorry!

It is “e” that is saving BOM.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs