Unicode Source Files

Is the DLR going to be fixed so that it properly supports Unicode
source files or is this an issue with IronRuby? If you attempt to
create a new Code File with Visual Studio 2008 and call it test.rb and
then execute it with:

ScriptRuntime runtime = IronRuby.Ruby.CreateRuntime();
runtime.ExecuteFile( “test.rb” );

it blows up on the Unicode byte-order marker with:

Unhandled Exception: Microsoft.Scripting.SyntaxErrorException: Invalid
character ‘ï’ in expression
at Microsoft.Scripting.ErrorSink.Add(SourceUnit source, String
message, SourceSpan span, Int32 errorCode, Severity severity) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\ErrorSink.cs:line
34
at Microsoft.Scripting.ErrorCounter.Add(SourceUnit source, String
message, SourceSpan span, Int32 errorCode, Severity severity) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\ErrorSink.cs:line
92
at IronRuby.Compiler.Tokenizer.Report(String message, Int32
errorCode, SourceSpan location, Severity severity) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
430
at IronRuby.Compiler.Tokenizer.ReportError(ErrorInfo info, Object[]
args) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
442
at IronRuby.Compiler.Tokenizer.Tokenize(Boolean whitespaceSeen,
Boolean cmdState) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
966
at IronRuby.Compiler.Tokenizer.Tokenize() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
739
at IronRuby.Compiler.Tokenizer.GetNextToken() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
711
at IronRuby.Compiler.Parser.GetNextToken() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Parser.cs:line
99
at IronRuby.Compiler.ShiftReduceParser`2.Parse() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\GPPG.cs:line
310
at IronRuby.Compiler.Parser.Parse(SourceUnit sourceUnit,
RubyCompilerOptions options, ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Parser.cs:line
158
at IronRuby.Runtime.RubyContext.ParseSourceCode(SourceUnit
sourceUnit, RubyCompilerOptions options, ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Runtime\RubyContext.cs:line
203
at IronRuby.Runtime.RubyContext.CompileSourceCode(SourceUnit
sourceUnit, CompilerOptions options, ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Runtime\RubyContext.cs:line
179
at Microsoft.Scripting.SourceUnit.Compile(CompilerOptions options,
ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\SourceUnit.cs:line
215
at Microsoft.Scripting.SourceUnit.Execute(Scope scope, ErrorSink
errorSink) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\SourceUnit.cs:line
225
at Microsoft.Scripting.Hosting.ScriptSource.Execute(ScriptScope
scope) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptSource.cs:line
129
at Microsoft.Scripting.Hosting.ScriptEngine.ExecuteFile(String
path, ScriptScope scope) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptEngine.cs:line
159
at Microsoft.Scripting.Hosting.ScriptEngine.ExecuteFile(String
path) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptEngine.cs:line
148
at Microsoft.Scripting.Hosting.ScriptRuntime.ExecuteFile(String
path) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptRuntime.cs:line
257
at HostingDLRConsole.Program.Main(String[] args) in
C:\Users\ted\Documents\Visual Studio 2008\Projects\Books\IronRuby in
Action\HostingDLRConsole\HostingDLRConsole\Program.cs:line 14
Press any key to continue . . .

I know I can fix this by using the Advanced Save Options but the DLR
spec talks about Unicode support, so I assume this means that
ScriptRuntime.ExecuteFile() should also support Unicode source files.

We do this for compatibility with Ruby 1.8.6, though as you can see, we
don’t have the error message quite right:

PS F:> C:\ruby\bin\ruby.exe x.rb
x.rb:1: Invalid char \377' in expression x.rb:1: Invalid char\376’ in expression

:slight_smile:

I believe you’ll need to save as UTF-8 and then manually strip the BOM
in order to use Unicode source files – hopefully Tomas will tell me if
I’m wrong.

Source encoding for Ruby is extremely tricky, and (from what I can tell)
hasn’t even yet been finalized for 1.9.x. We will eventually support
whatever the Ruby standards are.

Why so rigorous? I understand the need to maintain compatibility but
this effectively eliminates Visual Studio as an editor for .rb files,
without some kind of clunky build mechanism. I guess I will just use
an extension method to get around the behavior for the time being.

From the things I have read about Ruby and UTF-8, it seems more like
it is just extremely broken, rather than extremely tricky. I still
cannot even get pure Ruby stuff in Windows to work properly with
UTF-8, like when using the Shoes toolkit for example.

Here is the extension method I am using if anyone else is interested:

public static object ExecuteUnicodeFile( this ScriptRuntime rt, string
filename )
{
string rbCode;

// OpenText will strip the BOM and keep the Unicode intact
using( var rdr = File.OpenText( filename ) )
{
    rbCode = rdr.ReadToEnd();
}

return IronRuby.Ruby.GetEngine( rt ).Execute( rbCode );

}

It works great for using Japanese in strings in Ruby with IronRuby and
WPF.

If you save in “Western European (Windows) - Codepage 1252” from within
Visual Studio, you’ll get the right result – as long as you’re not
using any characters with a codepoint greater than 127. And if you are,
you’re probably better off anyway expressing this code point as an
explicit set of UTF-8 compatible bytes because – as you’ve noticed –
Ruby’s currently a bit weird in its Unicode support.