String#encode('utf-8') throws IllegalArgumentException on long strings (1505 MByte)

Hi,

I can reproduce this exception on Mac and Linux, for jruby-1.7.4
and jruby-head. I have not seen it on shorter strings. I have not
tested exactly at which size the problem starts to occur, but it seems
very reproducible for this particular test case on Mac and Linux.

HTH,

@peter_v

+++++++++++++++++++++++++++++++++++++++++++++++++++

On Mac OS X (16 GB RAM):

/Users/peter_v/dbd $ rvm current
jruby-1.7.4@dbd

/Users/peter_v/dbd $ ruby -v
jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot™ 64-Bit
Server VM 1.6.0_45-b06-451-11M4406 [darwin-x86_64]

/Users/peter_v/dbd $ java -version
java version “1.6.0_45”
Java™ SE Runtime Environment (build 1.6.0_45-b06-451-11M4406)
Java HotSpot™ 64-Bit Server VM (build 20.45-b01-451, mixed mode)

/Users/peter_v/dbd $ cat bin/test_3.rb

encoding=us-ascii

#row = “A” * 300 # NEVER fails with this value of row
row = “A” * 301 # ALWAYS fails with this value of row
count = 5_000_000

csv_string = row * count
encoded_string = csv_string.encode(“utf-8”)
/Users/peter_v/dbd $ time jruby -J-Xmx18000m bin/test_3.rb
CharBuffer.java:311:in allocate': java.lang.IllegalArgumentException from CharsetDecoder.java:775:indecode’
from CharsetTranscoder.java:81:in transcode' from CharsetTranscoder.java:64:intranscode’
from CharsetTranscoder.java:110:in transcode' from RubyString.java:7649:intranscode’
from RubyString.java:7590:in encode' from RubyString$INVOKER$i$encode.gen:-1:incall’
from CachingCallSite.java:326:in cacheAndCall' from CachingCallSite.java:170:incall’
from bin/test_3.rb:8:in __file__' from bin/test_3.rb:-1:inload’
from Ruby.java:807:in runScript' from Ruby.java:800:inrunScript’
from Ruby.java:669:in runNormally' from Ruby.java:518:inrunFromMain’
from Main.java:390:in doRunFromMain' from Main.java:279:ininternalRun’
from Main.java:221:in run' from Main.java:201:inmain’

real 0m5.942s
user 0m5.839s
sys 0m1.705s
/Users/peter_v/dbd $

++++++++++++++++++++++++++++++++++++++++++++++++++++

On Ubuntu 12.04 LTS 64 bit (20 GB RAM) with Oracle Java 7

peter_v@peter64:~/p/dbd$ rvm current
jruby-1.7.4@dbd

peter_v@peter64:~/p/dbd$ jruby -v
jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot™ 64-Bit
Server VM 1.7.0_25-b15 [linux-amd64]

peter_v@peter64:~/p/dbd$ java -version
java version “1.7.0_25”
Java™ SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot™ 64-Bit Server VM (build 23.25-b01, mixed mode)

peter_v@peter64:~/p/dbd$ uname -a
Linux peter64 3.5.0-34-generic #55~precise1-Ubuntu SMP Fri Jun 7
16:25:50
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

peter_v@peter64:~/p/dbd$ cat bin/test_3.rb

encoding=us-ascii

#row = “A” * 300 # NEVER fails with this value of row
row = “A” * 301 # ALWAYS fails with this value of row
count = 5_000_000

csv_string = row * count
encoded_string = csv_string.encode(“utf-8”)

peter_v@peter64:~/p/dbd$ time jruby -J-Xmx18000m bin/test_3.rb
CharBuffer.java:330:in allocate': java.lang.IllegalArgumentException from CharsetDecoder.java:792:indecode’
from CharsetTranscoder.java:81:in transcode' from CharsetTranscoder.java:64:intranscode’
from CharsetTranscoder.java:110:in transcode' from RubyString.java:7649:intranscode’
from RubyString.java:7590:in encode' from RubyString$INVOKER$i$encode.gen:-1:incall’
from CachingCallSite.java:326:in cacheAndCall' from CachingCallSite.java:170:incall’
from bin/test_3.rb:8:in __file__' from bin/test_3.rb:-1:inload’
from Ruby.java:807:in runScript' from Ruby.java:800:inrunScript’
from Ruby.java:669:in runNormally' from Ruby.java:518:inrunFromMain’
from Main.java:390:in doRunFromMain' from Main.java:279:ininternalRun’
from Main.java:221:in run' from Main.java:201:inmain’

real 0m4.459s
user 0m4.348s
sys 0m1.144s
peter_v@peter64:~/p/dbd$

++++++++++++++++++++++++++++++++++++++++++++++++++++

JRuby-head on Linux 64 bit (Java 1.7)

peter_v@peter64:~/p/dbd$ rvm use jruby-head@dbd
Using /home/peter_v/.rvm/gems/jruby-head with gemset dbd

peter_v@peter64:~/p/dbd$ rvm current
jruby-head@dbd

peter_v@peter64:~/p/dbd$ jruby -v
jruby 1.7.5.dev (1.9.3p392) 2013-06-23 fffffff on Java HotSpot™
64-Bit
Server VM 1.7.0_25-b15 [linux-amd64]

peter_v@peter64:~/p/dbd$ java -version
java version “1.7.0_25”
Java™ SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot™ 64-Bit Server VM (build 23.25-b01, mixed mode)

peter_v@peter64:~/p/dbd$ cat bin/test_3.rb

encoding=us-ascii

#row = “A” * 300 # NEVER fails with this value of row
row = “A” * 301 # ALWAYS fails with this value of row
count = 5_000_000

csv_string = row * count
encoded_string = csv_string.encode(“utf-8”)

peter_v@peter64:~/p/dbd$ time jruby -J-Xmx18000m bin/test_3.rb
CharBuffer.java:330:in allocate': java.lang.IllegalArgumentException from CharsetDecoder.java:792:indecode’
from CharsetTranscoder.java:81:in transcode' from CharsetTranscoder.java:64:intranscode’
from CharsetTranscoder.java:110:in transcode' from RubyString.java:7702:intranscode’
from RubyString.java:7643:in encode' from RubyString$INVOKER$i$encode.gen:-1:incall’
from CachingCallSite.java:326:in cacheAndCall' from CachingCallSite.java:170:incall’
from bin/test_3.rb:8:in __file__' from bin/test_3.rb:-1:inload’
from Ruby.java:810:in runScript' from Ruby.java:803:inrunScript’
from Ruby.java:672:in runNormally' from Ruby.java:521:inrunFromMain’
from Main.java:381:in doRunFromMain' from Main.java:278:ininternalRun’
from Main.java:220:in run' from Main.java:200:inmain’

real 0m4.823s
user 0m4.436s
sys 0m1.488s
peter_v@peter64:~/p/dbd$

+++++++++++++++++++++++++++++++++++++++++

Interesting one… can you file as an issue on github so we can track
it?

  • Charlie

On Fri, Jun 28, 2013 at 5:50 PM, Peter V.

This appears to be a JDK bug. The following code in CharsetDecoder
attempts to grow the CharBuffer it’s decoding into as it goes, but as
you get close to the signed 32-bit max for the incoming ByteBuffer,
this will overflow to negative and cause IllegalArgumentException in
CharBuffer.allocate.

       if (cr.isOverflow()) {
            n = 2*n + 1;    // Ensure progress; n might be 0!
            CharBuffer o = CharBuffer.allocate(n);

The only workaround I can offer is to not transcode such a large string.

JDK should probably be fixed to not overflow integer max here and be
more conservative growing the CharBuffer when approaching 2GB.

We can fix this by expanding the use of our own decode loop, which
tries to avoid over-allocating buffers. We could also fix it by
getting jcodings transcoding logic working, probably. But working with
String data close to signed 32-bit max is likely to run into other
issues since the JVM can only index arrays (e.g. byte[] in a String)
up to 32-bit size.

  • Charlie

On Sun, Jun 30, 2013 at 6:59 PM, Charles Oliver N.

On Jun 30, 2013, at 6:59 PM, Charles Oliver N. [email protected]
wrote:

Interesting one… can you file as an issue on github so we can track it?

  • Charlie

For reference: String#encode('utf-8') throws IllegalArgumentException on long strings (1505 MByte) · Issue #845 · jruby/jruby · GitHub