Forum: Ruby-core [ruby-trunk - Bug #7442][Open] StringScanner#charpos vs StringScanner#pos

Posted by zenspider (Ryan Davis) (Guest)
on 2012-11-27 03:05
(Received via mailing list)
Issue #7442 has been reported by zenspider (Ryan Davis).

----------------------------------------
Bug #7442: StringScanner#charpos vs StringScanner#pos
https://bugs.ruby-lang.org/issues/7442

Author: zenspider (Ryan Davis)
Status: Open
Priority: Normal
Assignee:
Category: ext
Target version:
ruby -v: 1.9.x


=begin
I talked to Matz at rubyconf and he agreed this was a bug I should file. 
Sorry I took so long to do so.

As mentioned in #3482, StringScanner#pos is byte-oriented even when 
scanning multibyte strings. The reasoning was that IO#pos is 
byte-oriented so this is to spec and functioning correctly. The problem 
is that StringScanner isn't _just_ an IO as it also represents a String 
and the progress scanning through it. Strings in 1.9+ must respect their 
encodings and with a few exceptions don't even support the idea of naked 
bytes. I think StringScanner must be able to respect that.

Given that `ss` is a StringScanner instance on a string with a valid 
encoding, getting the substring of the current progress via 
`ss.string[0..ss.pos]` can result in a String with _invalid_ encoding. 
I propose that we add `#charpos` to make it possible to pull out a valid 
substring. This would also be useful towards being able to report proper 
offset or column information in the case of an error when you're using 
StringScanner as your lexer.

This is the code that I needed to get proper char-offsets (and 
substrings--I needed both for my purposes):

    def string_to_pos
      string.byteslice(0, pos)
    end

    def charpos
      string_to_pos.length
    end

=end
Posted by mame (Yusuke Endoh) (Guest)
on 2012-11-27 04:25
(Received via mailing list)
Issue #7442 has been updated by mame (Yusuke Endoh).

Status changed from Open to Feedback
Target version set to Next Major

Sorry, it is too late to fix such a spec-level bug.  Setting the target 
to Next Major.
If you create and commit a patch by preview2 (1 Dec.), and if it does 
not lead to any problem (and any discussion) at all, we might include it 
in 2.0.0.

--
Yusuke Endoh <mame@tsg.ne.jp>
----------------------------------------
Bug #7442: StringScanner#charpos vs StringScanner#pos
https://bugs.ruby-lang.org/issues/7442#change-34004

Author: zenspider (Ryan Davis)
Status: Feedback
Priority: Normal
Assignee:
Category: ext
Target version: Next Major
ruby -v: 1.9.x


=begin
I talked to Matz at rubyconf and he agreed this was a bug I should file. 
Sorry I took so long to do so.

As mentioned in #3482, StringScanner#pos is byte-oriented even when 
scanning multibyte strings. The reasoning was that IO#pos is 
byte-oriented so this is to spec and functioning correctly. The problem 
is that StringScanner isn't _just_ an IO as it also represents a String 
and the progress scanning through it. Strings in 1.9+ must respect their 
encodings and with a few exceptions don't even support the idea of naked 
bytes. I think StringScanner must be able to respect that.

Given that `ss` is a StringScanner instance on a string with a valid 
encoding, getting the substring of the current progress via 
`ss.string[0..ss.pos]` can result in a String with _invalid_ encoding. 
I propose that we add `#charpos` to make it possible to pull out a valid 
substring. This would also be useful towards being able to report proper 
offset or column information in the case of an error when you're using 
StringScanner as your lexer.

This is the code that I needed to get proper char-offsets (and 
substrings--I needed both for my purposes):

    def string_to_pos
      string.byteslice(0, pos)
    end

    def charpos
      string_to_pos.length
    end

=end
Posted by zenspider (Ryan Davis) (Guest)
on 2012-11-28 01:39
(Received via mailing list)
Issue #7442 has been updated by zenspider (Ryan Davis).


Committed revision 37916.

Please beat up on it.
----------------------------------------
Bug #7442: StringScanner#charpos vs StringScanner#pos
https://bugs.ruby-lang.org/issues/7442#change-34059

Author: zenspider (Ryan Davis)
Status: Feedback
Priority: Normal
Assignee:
Category: ext
Target version: Next Major
ruby -v: 1.9.x


=begin
I talked to Matz at rubyconf and he agreed this was a bug I should file. 
Sorry I took so long to do so.

As mentioned in #3482, StringScanner#pos is byte-oriented even when 
scanning multibyte strings. The reasoning was that IO#pos is 
byte-oriented so this is to spec and functioning correctly. The problem 
is that StringScanner isn't _just_ an IO as it also represents a String 
and the progress scanning through it. Strings in 1.9+ must respect their 
encodings and with a few exceptions don't even support the idea of naked 
bytes. I think StringScanner must be able to respect that.

Given that `ss` is a StringScanner instance on a string with a valid 
encoding, getting the substring of the current progress via 
`ss.string[0..ss.pos]` can result in a String with _invalid_ encoding. 
I propose that we add `#charpos` to make it possible to pull out a valid 
substring. This would also be useful towards being able to report proper 
offset or column information in the case of an error when you're using 
StringScanner as your lexer.

This is the code that I needed to get proper char-offsets (and 
substrings--I needed both for my purposes):

    def string_to_pos
      string.byteslice(0, pos)
    end

    def charpos
      string_to_pos.length
    end

=end
Posted by zenspider (Ryan Davis) (Guest)
on 2012-11-29 01:41
(Received via mailing list)
Issue #7442 has been updated by zenspider (Ryan Davis).


No objections (yet)... can this be merged to 2.0 branch for next preview 
release?
----------------------------------------
Bug #7442: StringScanner#charpos vs StringScanner#pos
https://bugs.ruby-lang.org/issues/7442#change-34104

Author: zenspider (Ryan Davis)
Status: Feedback
Priority: Normal
Assignee:
Category: ext
Target version: Next Major
ruby -v: 1.9.x


=begin
I talked to Matz at rubyconf and he agreed this was a bug I should file. 
Sorry I took so long to do so.

As mentioned in #3482, StringScanner#pos is byte-oriented even when 
scanning multibyte strings. The reasoning was that IO#pos is 
byte-oriented so this is to spec and functioning correctly. The problem 
is that StringScanner isn't _just_ an IO as it also represents a String 
and the progress scanning through it. Strings in 1.9+ must respect their 
encodings and with a few exceptions don't even support the idea of naked 
bytes. I think StringScanner must be able to respect that.

Given that `ss` is a StringScanner instance on a string with a valid 
encoding, getting the substring of the current progress via 
`ss.string[0..ss.pos]` can result in a String with _invalid_ encoding. 
I propose that we add `#charpos` to make it possible to pull out a valid 
substring. This would also be useful towards being able to report proper 
offset or column information in the case of an error when you're using 
StringScanner as your lexer.

This is the code that I needed to get proper char-offsets (and 
substrings--I needed both for my purposes):

    def string_to_pos
      string.byteslice(0, pos)
    end

    def charpos
      string_to_pos.length
    end

=end
Posted by mame (Yusuke Endoh) (Guest)
on 2012-11-29 04:26
(Received via mailing list)
Issue #7442 has been updated by mame (Yusuke Endoh).

Status changed from Feedback to Closed

I think so.  We can keep it unless any serious problem is reported after 
preview2.  Thanks for your quick action!

I'm slightly worried about its very inefficient implementation, but I 
don't know whether it matters because I understand the use case. 
Anyway, we can refine it after 2.0.0.

--
Yusuke Endoh <mame@tsg.ne.jp>
----------------------------------------
Bug #7442: StringScanner#charpos vs StringScanner#pos
https://bugs.ruby-lang.org/issues/7442#change-34115

Author: zenspider (Ryan Davis)
Status: Closed
Priority: Normal
Assignee:
Category: ext
Target version: Next Major
ruby -v: 1.9.x


=begin
I talked to Matz at rubyconf and he agreed this was a bug I should file. 
Sorry I took so long to do so.

As mentioned in #3482, StringScanner#pos is byte-oriented even when 
scanning multibyte strings. The reasoning was that IO#pos is 
byte-oriented so this is to spec and functioning correctly. The problem 
is that StringScanner isn't _just_ an IO as it also represents a String 
and the progress scanning through it. Strings in 1.9+ must respect their 
encodings and with a few exceptions don't even support the idea of naked 
bytes. I think StringScanner must be able to respect that.

Given that `ss` is a StringScanner instance on a string with a valid 
encoding, getting the substring of the current progress via 
`ss.string[0..ss.pos]` can result in a String with _invalid_ encoding. 
I propose that we add `#charpos` to make it possible to pull out a valid 
substring. This would also be useful towards being able to report proper 
offset or column information in the case of an error when you're using 
StringScanner as your lexer.

This is the code that I needed to get proper char-offsets (and 
substrings--I needed both for my purposes):

    def string_to_pos
      string.byteslice(0, pos)
    end

    def charpos
      string_to_pos.length
    end

=end
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.