Forum: JRuby String#encode('utf-8') throws IllegalArgumentException on long strings (1505 MByte)

988320371d9a18f0d50375188e01d54a?d=identicon&s=25 Peter V. (peter_v)
on 2013-06-28 23:51
(Received via mailing list)
Hi,

I can reproduce this exception on Mac and Linux, for jruby-1.7.4
and jruby-head. I have not seen it on shorter strings. I have not
tested exactly at which size the problem starts to occur, but it seems
very reproducible for this particular test case on Mac and Linux.

HTH,

@peter_v

+++++++++++++++++++++++++++++++++++++++++++++++++++

On Mac OS X (16 GB RAM):

/Users/peter_v/dbd $ rvm current
jruby-1.7.4@dbd

/Users/peter_v/dbd $ ruby -v
jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) 64-Bit
Server VM 1.6.0_45-b06-451-11M4406 [darwin-x86_64]

/Users/peter_v/dbd $ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06-451-11M4406)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01-451, mixed mode)

/Users/peter_v/dbd $ cat bin/test_3.rb
# encoding=us-ascii

#row = "A" * 300 # NEVER fails with this value of `row`
row = "A" * 301 # ALWAYS fails with this value of `row`
count = 5_000_000

csv_string = row * count
encoded_string = csv_string.encode("utf-8")
/Users/peter_v/dbd $ time jruby -J-Xmx18000m bin/test_3.rb
CharBuffer.java:311:in `allocate': java.lang.IllegalArgumentException
    from CharsetDecoder.java:775:in `decode'
    from CharsetTranscoder.java:81:in `transcode'
    from CharsetTranscoder.java:64:in `transcode'
    from CharsetTranscoder.java:110:in `transcode'
    from RubyString.java:7649:in `transcode'
    from RubyString.java:7590:in `encode'
    from RubyString$INVOKER$i$encode.gen:-1:in `call'
    from CachingCallSite.java:326:in `cacheAndCall'
    from CachingCallSite.java:170:in `call'
    from bin/test_3.rb:8:in `__file__'
    from bin/test_3.rb:-1:in `load'
    from Ruby.java:807:in `runScript'
    from Ruby.java:800:in `runScript'
    from Ruby.java:669:in `runNormally'
    from Ruby.java:518:in `runFromMain'
    from Main.java:390:in `doRunFromMain'
    from Main.java:279:in `internalRun'
    from Main.java:221:in `run'
    from Main.java:201:in `main'

real    0m5.942s
user    0m5.839s
sys    0m1.705s
/Users/peter_v/dbd $

++++++++++++++++++++++++++++++++++++++++++++++++++++

On Ubuntu 12.04 LTS 64 bit (20 GB RAM) with Oracle Java 7

peter_v@peter64:~/p/dbd$ rvm current
jruby-1.7.4@dbd

peter_v@peter64:~/p/dbd$ jruby -v
jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) 64-Bit
Server VM 1.7.0_25-b15 [linux-amd64]

peter_v@peter64:~/p/dbd$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

peter_v@peter64:~/p/dbd$ uname -a
Linux peter64 3.5.0-34-generic #55~precise1-Ubuntu SMP Fri Jun 7
16:25:50
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

peter_v@peter64:~/p/dbd$ cat bin/test_3.rb
# encoding=us-ascii

#row = "A" * 300 # NEVER fails with this value of `row`
row = "A" * 301 # ALWAYS fails with this value of `row`
count = 5_000_000

csv_string = row * count
encoded_string = csv_string.encode("utf-8")

peter_v@peter64:~/p/dbd$ time jruby -J-Xmx18000m bin/test_3.rb
CharBuffer.java:330:in `allocate': java.lang.IllegalArgumentException
    from CharsetDecoder.java:792:in `decode'
    from CharsetTranscoder.java:81:in `transcode'
    from CharsetTranscoder.java:64:in `transcode'
    from CharsetTranscoder.java:110:in `transcode'
    from RubyString.java:7649:in `transcode'
    from RubyString.java:7590:in `encode'
    from RubyString$INVOKER$i$encode.gen:-1:in `call'
    from CachingCallSite.java:326:in `cacheAndCall'
    from CachingCallSite.java:170:in `call'
    from bin/test_3.rb:8:in `__file__'
    from bin/test_3.rb:-1:in `load'
    from Ruby.java:807:in `runScript'
    from Ruby.java:800:in `runScript'
    from Ruby.java:669:in `runNormally'
    from Ruby.java:518:in `runFromMain'
    from Main.java:390:in `doRunFromMain'
    from Main.java:279:in `internalRun'
    from Main.java:221:in `run'
    from Main.java:201:in `main'

real    0m4.459s
user    0m4.348s
sys    0m1.144s
peter_v@peter64:~/p/dbd$

++++++++++++++++++++++++++++++++++++++++++++++++++++

JRuby-head on Linux 64 bit (Java 1.7)

peter_v@peter64:~/p/dbd$ rvm use jruby-head@dbd
Using /home/peter_v/.rvm/gems/jruby-head with gemset dbd

peter_v@peter64:~/p/dbd$ rvm current
jruby-head@dbd

peter_v@peter64:~/p/dbd$ jruby -v
jruby 1.7.5.dev (1.9.3p392) 2013-06-23 fffffff on Java HotSpot(TM)
64-Bit
Server VM 1.7.0_25-b15 [linux-amd64]

peter_v@peter64:~/p/dbd$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

peter_v@peter64:~/p/dbd$ cat bin/test_3.rb
# encoding=us-ascii

#row = "A" * 300 # NEVER fails with this value of `row`
row = "A" * 301 # ALWAYS fails with this value of `row`
count = 5_000_000

csv_string = row * count
encoded_string = csv_string.encode("utf-8")

peter_v@peter64:~/p/dbd$ time jruby -J-Xmx18000m bin/test_3.rb
CharBuffer.java:330:in `allocate': java.lang.IllegalArgumentException
    from CharsetDecoder.java:792:in `decode'
    from CharsetTranscoder.java:81:in `transcode'
    from CharsetTranscoder.java:64:in `transcode'
    from CharsetTranscoder.java:110:in `transcode'
    from RubyString.java:7702:in `transcode'
    from RubyString.java:7643:in `encode'
    from RubyString$INVOKER$i$encode.gen:-1:in `call'
    from CachingCallSite.java:326:in `cacheAndCall'
    from CachingCallSite.java:170:in `call'
    from bin/test_3.rb:8:in `__file__'
    from bin/test_3.rb:-1:in `load'
    from Ruby.java:810:in `runScript'
    from Ruby.java:803:in `runScript'
    from Ruby.java:672:in `runNormally'
    from Ruby.java:521:in `runFromMain'
    from Main.java:381:in `doRunFromMain'
    from Main.java:278:in `internalRun'
    from Main.java:220:in `run'
    from Main.java:200:in `main'

real    0m4.823s
user    0m4.436s
sys    0m1.488s
peter_v@peter64:~/p/dbd$

+++++++++++++++++++++++++++++++++++++++++
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles Nutter (headius)
on 2013-07-02 01:44
(Received via mailing list)
Interesting one... can you file as an issue on github so we can track
it?

- Charlie

On Fri, Jun 28, 2013 at 5:50 PM, Peter Vandenabeele
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles Nutter (headius)
on 2013-07-02 01:55
(Received via mailing list)
This appears to be a JDK bug. The following code in CharsetDecoder
attempts to grow the CharBuffer it's decoding into as it goes, but as
you get close to the signed 32-bit max for the incoming ByteBuffer,
this will overflow to negative and cause IllegalArgumentException in
CharBuffer.allocate.

           if (cr.isOverflow()) {
                n = 2*n + 1;    // Ensure progress; n might be 0!
                CharBuffer o = CharBuffer.allocate(n);

The only workaround I can offer is to not transcode such a large string.

JDK should probably be fixed to not overflow integer max here and be
more conservative growing the CharBuffer when approaching 2GB.

We can fix this by expanding the use of our own decode loop, which
tries to avoid over-allocating buffers. We could also fix it by
getting jcodings transcoding logic working, probably. But working with
String data close to signed 32-bit max is likely to run into other
issues since the JVM can only index arrays (e.g. byte[] in a String)
up to 32-bit size.

- Charlie

On Sun, Jun 30, 2013 at 6:59 PM, Charles Oliver Nutter
40e5e9fe36a1f85166493faac2c17499?d=identicon&s=25 Hirotsugu Asari (Guest)
on 2013-07-02 19:25
(Received via mailing list)
On Jun 30, 2013, at 6:59 PM, Charles Oliver Nutter <headius@headius.com>
wrote:

> Interesting one... can you file as an issue on github so we can track it?
>
> - Charlie

For reference: https://github.com/jruby/jruby/issues/845
This topic is locked and can not be replied to.