Forum: Ruby-core [ruby-trunk - Bug #8129][Open] String#index has drastically different performance when a single unic

Posted by Zach Moazeni (zmoazeni)
on 2013-03-20 00:25
(Received via mailing list)
Issue #8129 has been reported by zmoazeni (Zach Moazeni).

----------------------------------------
Bug #8129: String#index has drastically different performance when a 
single unicode character is included
https://bugs.ruby-lang.org/issues/8129

Author: zmoazeni (Zach Moazeni)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: 2.0.0-p0


I created a simple ruby script:

```
#! /usr/bin/env ruby

raise "need a file name" unless ARGV[0]
contents = File.read(ARGV[0])

326_000.times do |i|
  contents[(i + 23) % contents.size]
end
```

And I uploaded two files below. One is all ASCII characters and the 
other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. 
Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with 
one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, 
all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 
minutes!

Any idea why the performance is so dramatically different between the 
two?
Posted by charliesome (Charlie Somerville) (Guest)
on 2013-03-20 00:41
(Received via mailing list)
Issue #8129 has been updated by charliesome (Charlie Somerville).

Status changed from Open to Rejected

When all the characters in a string are ASCII characters (single bytes), 
the byte index for any given character can be calculated in constant 
time.

When the string contains multibyte characters, finding the byte index 
given a character index becomes O(n).

If you need fast character indexing, try splitting the string into an 
array or characters.
----------------------------------------
Bug #8129: String#index has drastically different performance when a 
single unicode character is included
https://bugs.ruby-lang.org/issues/8129#change-37748

Author: zmoazeni (Zach Moazeni)
Status: Rejected
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: 2.0.0-p0


I created a simple ruby script:

```
#! /usr/bin/env ruby

raise "need a file name" unless ARGV[0]
contents = File.read(ARGV[0])

326_000.times do |i|
  contents[(i + 23) % contents.size]
end
```

And I uploaded two files below. One is all ASCII characters and the 
other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. 
Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with 
one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, 
all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 
minutes!

Any idea why the performance is so dramatically different between the 
two?
Posted by Nobuyoshi Nakada (nobu)
on 2013-03-20 00:45
(Received via mailing list)
Issue #8129 has been updated by nobu (Nobuyoshi Nakada).

Description updated


----------------------------------------
Bug #8129: String#index has drastically different performance when a 
single unicode character is included
https://bugs.ruby-lang.org/issues/8129#change-37749

Author: zmoazeni (Zach Moazeni)
Status: Rejected
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: 2.0.0-p0


=begin
I created a simple ruby script:

 #! /usr/bin/env ruby

 raise "need a file name" unless ARGV[0]
 contents = File.read(ARGV[0])

 326_000.times do |i|
   contents[(i + 23) % contents.size]
 end

And I uploaded two files below. One is all ASCII characters and the 
other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. 
Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with 
one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, 
all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 
minutes!

Any idea why the performance is so dramatically different between the 
two?
=end
Posted by Nobuyoshi Nakada (nobu)
on 2013-03-20 00:53
(Received via mailing list)
Issue #8129 has been updated by nobu (Nobuyoshi Nakada).


You may want to:
* use regexp, e.g. scan.
* convert to fix width wide char encoding, i.e., UTF-32LE or UTF-32BE.
----------------------------------------
Bug #8129: String#index has drastically different performance when a 
single unicode character is included
https://bugs.ruby-lang.org/issues/8129#change-37750

Author: zmoazeni (Zach Moazeni)
Status: Rejected
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: 2.0.0-p0


=begin
I created a simple ruby script:

 #! /usr/bin/env ruby

 raise "need a file name" unless ARGV[0]
 contents = File.read(ARGV[0])

 326_000.times do |i|
   contents[(i + 23) % contents.size]
 end

And I uploaded two files below. One is all ASCII characters and the 
other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. 
Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with 
one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, 
all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 
minutes!

Any idea why the performance is so dramatically different between the 
two?
=end
Posted by Zach Moazeni (zmoazeni)
on 2013-03-20 01:00
(Received via mailing list)
Issue #8129 has been updated by zmoazeni (Zach Moazeni).


Thanks for the feedback guys. This came up from 
https://github.com/kschiess/parslet/issues/73 which heavily uses 
String#index 
(http://www.ruby-doc.org/core-2.0/String.html#method-i-index) by passing 
a position to search from as the source content was consumed.


----------------------------------------
Bug #8129: String#index has drastically different performance when a 
single unicode character is included
https://bugs.ruby-lang.org/issues/8129#change-37751

Author: zmoazeni (Zach Moazeni)
Status: Rejected
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: 2.0.0-p0


=begin
I created a simple ruby script:

 #! /usr/bin/env ruby

 raise "need a file name" unless ARGV[0]
 contents = File.read(ARGV[0])

 326_000.times do |i|
   contents[(i + 23) % contents.size]
 end

And I uploaded two files below. One is all ASCII characters and the 
other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. 
Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with 
one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, 
all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 
minutes!

Any idea why the performance is so dramatically different between the 
two?
=end
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.