Csv.rb, utf-8, 1.9 compatibility

dubstep · March 1, 2011, 4:36pm

I have a tiny CSV file (UTF-8),
and also a tiny script reading it with the help of csv.rb.

require 'csv'

# $ ~/.rvm/bin/ruby-1.9.2-p180       -w test-utf.03.rb
# $ ~/.rvm/bin/jruby-1.6.0.RC2 --1.9 -w test-utf.03.rb

f = open('g.UTF-8.csv', "r:UTF-8")

CSV.new(f,headers: true, row_sep: "\r\n").each do |csv_record|

  if true
    printf "%s=>{%05.5d},{%s}=>{%s}\n" \
      ,'$.',$. \
      ,'f1',csv_record['f1'] \
      ;
  end

end

It runs through just as expected with ruby-1.9.2-p180,
but it runs into an exception with jruby-1.6.0.RC2 :

$ ~/.rvm/bin/jruby-1.6.0.RC2 --ng --1.9 -w test-utf.03.rb
$.=>{00002},{f1}=>{L1}
$.=>{00003},{f1}=>{L2}
org/jruby/RubyString.java:2858:in `gsub19': invalid byte sequence in

UTF-8 (ArgumentError)
from org/jruby/RubyString.java:2838:in gsub_bang19' from /home/jochen_hayek/.rvm/rubies/jruby-1.6.0.RC2/lib/ruby/1.9/csv.rb:1870:inshift’
from org/jruby/RubyArray.java:1676:in each' from /home/jochen_hayek/.rvm/rubies/jruby-1.6.0.RC2/lib/ruby/1.9/csv.rb:1863:inshift’
from org/jruby/RubyKernel.java:1418:in loop' from /home/jochen_hayek/.rvm/rubies/jruby-1.6.0.RC2/lib/ruby/1.9/csv.rb:1825:inshift’
from
/home/jochen_hayek/.rvm/rubies/jruby-1.6.0.RC2/lib/ruby/1.9/csv.rb:1768:in
each' from test-utf.03.rb:10:in(root)’

I talked to csv.rb’s developer and maintainer,
he also thinks, jruby (–1.9) needs fixing, not csv.rb.

I hesitated posting this case here,
but I just read the 1.6.0.RC2 note
and esp. the call “to help us round out our 1.9.2 compatibility before
the final 1.6.0 release”.
(“It would be especially helpful if users would test out 1.9 mode …”)

J.

jochen_hayek · March 1, 2011, 6:20pm

I just tried this with master and we have fixed the malformed UTF-8
character error already. However, on Windows with both MRI 1.9.2
(RubyInstaller) and JRuby master (destined to be 1.6.0.RC3) I see the
following error:

…
CHARS: ISO-8859-1
CHARS: US-ASCII
CHARS: ISO-8859-1
CHARS: US-ASCII
CSV::MalformedCSVError: Unquoted fields do not allow \r or \n (line 1).
shift at C:/opt/jruby.cygwin/lib/ruby/1.9/csv.rb:1893
each at org/jruby/RubyArray.java:1676
shift at C:/opt/jruby.cygwin/lib/ruby/1.9/csv.rb:1863

(The error is the same error in MRI as well)

So I think the problem you reported is fixed and I don’t know about
the error I am seeing. Perhaps this is something strange only
happening on Windows? I assume this script runs to completion without
error on your machine.

-Tom

On Tue, Mar 1, 2011 at 9:35 AM, Jochen H.
[email protected] wrote:

CSV.new(f,headers: true, row_sep: “\r\n”).each do |csv_record|
It runs through just as expected with ruby-1.9.2-p180,
from org/jruby/RubyKernel.java:1418:in `loop’
(“It would be especially helpful if users would test out 1.9 mode …”)

–
blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

jochen_hayek · March 2, 2011, 9:09am

Hi,

when I read CSV files in Ruby 1.9 I must put this line into my source
code (first line !):

encoding: UTF-8

If the line is not present, Ruby 1.9 thinks that the CSV file is in
US-ASCII and when encounters utf8 chars it raises “invalid byte sequence
in UTF-8 (ArgumentError)”

This is not necessary with Ruby 1.8

Hope this helps

Il 01/03/2011 16:35, Jochen H. ha scritto:

jochen_hayek · March 1, 2011, 6:31pm

HEH. If you are concerned with the CHARS output that was a print I
put into my dev tree to fix another problem…

-Tom

On Tue, Mar 1, 2011 at 11:19 AM, Thomas E Enebo [email protected]
wrote:

CSV::MalformedCSVError: Unquoted fields do not allow \r or \n (line 1).
 from 

/home/jochen_hayek/.rvm/rubies/jruby-1.6.0.RC2/lib/ruby/1.9/csv.rb:1870:in `shift’

I hesitated posting this case here,
To unsubscribe from this list, please visit:

–
blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

jochen_hayek · March 2, 2011, 3:53pm

jruby-head (1.8.7 patchlevel 330) still reports the same exception stack
for me on linux-i386-java and darwin-x86_64-java.

And ruby-head (1.9.3dev / 2011-03-02 trunk 31005 – MRI) still works for
me, on i686-linux as well as on x86_64-darwin10.6.0,
no problems with csv.rb whatsoever.

The CSV file actually has “\r\n” line endings.
Maybe … that “row_sep:” is incorrectly dealt with on Windows.

jochen_hayek · March 2, 2011, 4:09pm

Ah…just ran on MacOS and I do see the error now. I think you are
right that on Windows there is something odd about the newline
handling before we actually get to the offending characters (and not
just odd on JRuby since MRI also had the same issue). Digging into
this.

-Tom

On Wed, Mar 2, 2011 at 8:53 AM, Jochen H.
[email protected] wrote:

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

–
blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

jochen_hayek · March 3, 2011, 3:07pm

Now it works well for me on MacOS and Linux, too.

Thanks!

J.

jochen_hayek · March 2, 2011, 9:03pm

Fixed on master in commit 9d82fe3. The problem was we were not wiping
out strings old coderange values before setting a new one (a bunch of
bit math). So we switched a code range from being marked at 7bit only
text to one which was VALID, but the bits of 7bit | VALID == BROKEN!

I was able to run your test and it worked fine on MacOS.

-Tom

On Wed, Mar 2, 2011 at 9:08 AM, Thomas E Enebo [email protected]
wrote:

http://xircles.codehaus.org/manage_email

–
blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

–
blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]