Unicode Source Files


#1

Is the DLR going to be fixed so that it properly supports Unicode
source files or is this an issue with IronRuby? If you attempt to
create a new Code File with Visual Studio 2008 and call it test.rb and
then execute it with:

ScriptRuntime runtime = IronRuby.Ruby.CreateRuntime();
runtime.ExecuteFile( “test.rb” );

it blows up on the Unicode byte-order marker with:

Unhandled Exception: Microsoft.Scripting.SyntaxErrorException: Invalid
character ‘ï’ in expression
at Microsoft.Scripting.ErrorSink.Add(SourceUnit source, String
message, SourceSpan span, Int32 errorCode, Severity severity) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\ErrorSink.cs:line
34
at Microsoft.Scripting.ErrorCounter.Add(SourceUnit source, String
message, SourceSpan span, Int32 errorCode, Severity severity) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\ErrorSink.cs:line
92
at IronRuby.Compiler.Tokenizer.Report(String message, Int32
errorCode, SourceSpan location, Severity severity) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
430
at IronRuby.Compiler.Tokenizer.ReportError(ErrorInfo info, Object[]
args) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
442
at IronRuby.Compiler.Tokenizer.Tokenize(Boolean whitespaceSeen,
Boolean cmdState) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
966
at IronRuby.Compiler.Tokenizer.Tokenize() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
739
at IronRuby.Compiler.Tokenizer.GetNextToken() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Tokenizer.cs:line
711
at IronRuby.Compiler.Parser.GetNextToken() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Parser.cs:line
99
at IronRuby.Compiler.ShiftReduceParser`2.Parse() in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\GPPG.cs:line
310
at IronRuby.Compiler.Parser.Parse(SourceUnit sourceUnit,
RubyCompilerOptions options, ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Compiler\Parser\Parser.cs:line
158
at IronRuby.Runtime.RubyContext.ParseSourceCode(SourceUnit
sourceUnit, RubyCompilerOptions options, ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Runtime\RubyContext.cs:line
203
at IronRuby.Runtime.RubyContext.CompileSourceCode(SourceUnit
sourceUnit, CompilerOptions options, ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\ironruby\Runtime\RubyContext.cs:line
179
at Microsoft.Scripting.SourceUnit.Compile(CompilerOptions options,
ErrorSink errorSink) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\SourceUnit.cs:line
215
at Microsoft.Scripting.SourceUnit.Execute(Scope scope, ErrorSink
errorSink) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\SourceUnit.cs:line
225
at Microsoft.Scripting.Hosting.ScriptSource.Execute(ScriptScope
scope) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptSource.cs:line
129
at Microsoft.Scripting.Hosting.ScriptEngine.ExecuteFile(String
path, ScriptScope scope) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptEngine.cs:line
159
at Microsoft.Scripting.Hosting.ScriptEngine.ExecuteFile(String
path) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptEngine.cs:line
148
at Microsoft.Scripting.Hosting.ScriptRuntime.ExecuteFile(String
path) in
C:\Users\ted\Desktop\IronRuby\src\Microsoft.Scripting\Hosting\ScriptRuntime.cs:line
257
at HostingDLRConsole.Program.Main(String[] args) in
C:\Users\ted\Documents\Visual Studio 2008\Projects\Books\IronRuby in
Action\HostingDLRConsole\HostingDLRConsole\Program.cs:line 14
Press any key to continue . . .

I know I can fix this by using the Advanced Save Options but the DLR
spec talks about Unicode support, so I assume this means that
ScriptRuntime.ExecuteFile() should also support Unicode source files.


#2

We do this for compatibility with Ruby 1.8.6, though as you can see, we
don’t have the error message quite right:

PS F:> C:\ruby\bin\ruby.exe x.rb
x.rb:1: Invalid char \377' in expression x.rb:1: Invalid char\376’ in expression

:slight_smile:

I believe you’ll need to save as UTF-8 and then manually strip the BOM
in order to use Unicode source files – hopefully Tomas will tell me if
I’m wrong.

Source encoding for Ruby is extremely tricky, and (from what I can tell)
hasn’t even yet been finalized for 1.9.x. We will eventually support
whatever the Ruby standards are.


#3

Why so rigorous? I understand the need to maintain compatibility but
this effectively eliminates Visual Studio as an editor for .rb files,
without some kind of clunky build mechanism. I guess I will just use
an extension method to get around the behavior for the time being.

From the things I have read about Ruby and UTF-8, it seems more like
it is just extremely broken, rather than extremely tricky. I still
cannot even get pure Ruby stuff in Windows to work properly with
UTF-8, like when using the Shoes toolkit for example.


#4

Here is the extension method I am using if anyone else is interested:

public static object ExecuteUnicodeFile( this ScriptRuntime rt, string
filename )
{
string rbCode;

// OpenText will strip the BOM and keep the Unicode intact
using( var rdr = File.OpenText( filename ) )
{
    rbCode = rdr.ReadToEnd();
}

return IronRuby.Ruby.GetEngine( rt ).Execute( rbCode );

}

It works great for using Japanese in strings in Ruby with IronRuby and
WPF.


#5

If you save in “Western European (Windows) - Codepage 1252” from within
Visual Studio, you’ll get the right result – as long as you’re not
using any characters with a codepoint greater than 127. And if you are,
you’re probably better off anyway expressing this code point as an
explicit set of UTF-8 compatible bytes because – as you’ve noticed –
Ruby’s currently a bit weird in its Unicode support.