Control Characters

I am attempting a text editor using wxRuby. I’m having character issues.

Strings are not binary-safe.

Some characters are not allowed.

  • newline / line feed (\n), tab (\t) are displayed
  • carraige return (\r) is stripped
  • Other control characters and high-ascii cause control values to become
    empty.

Affected controls include: Wx::TextCtrl, Wx::StaticText, Wx::Clipboard,
et al.

Most text editors allow editing of recognized characters (in whatever
specified format without disrupting unknown characters.

Known control/special characters are hidden unless usually printable
(CR, LF, HT). When “highlight special characters” is enabled, they are
displayed as their abbr or icon: “[CR]”,"[LF]","[HT]","[NUL]".

For example, editors Scite, ConTEXT, et al.

The only solution I can come come up with is to implement folding using
XML or similar. The real document is sanitized before displayed in the
control. When the control is edited, a compare is performed and the real
document is modified accordingly.

The Wx::Clipboard is even worse. If I copy text from non UTF-8, the
original encoding/binary is lost. Ug.

Thoughts?

Thanks in advance.

-AH

Hi

Alexander Hawley wrote:

I am attempting a text editor using wxRuby. I’m having character issues.

Strings are not binary-safe.

Some characters are not allowed.

Can you explain what you mean - when they are being set from Ruby to the
control (eg TextCtrl#value=) or retrieving user input from the GUI?

Ideally a short test case or example bit of code which shows what’s
going on.

How it’s expected to work: all strings passed into wxRuby should be
UTF-8; all strings returned from wxRuby will be UTF-8. If you have data
(eg read from a file/) in another encoding, you need to use Iconv or
similar to fix that.

Known control/special characters are hidden unless usually printable
(CR, LF, HT). When “highlight special characters” is enabled, they are
displayed as their abbr or icon: “[CR]”,"[LF]","[HT]","[NUL]".

For example, editors Scite, ConTEXT, et al.

If you want a code-oriented text editor, you probably want to use
Wx::StyledTextCtrl as the base for your text control. It’s the same
editing/highlighting component used by Scite.

The only solution I can come come up with is to implement folding using
XML or similar. The real document is sanitized before displayed in the
control. When the control is edited, a compare is performed and the real
document is modified accordingly.

The Wx::Clipboard is even worse. If I copy text from non UTF-8, the
original encoding/binary is lost. Ug.

A short test case would really help here: what DataObject, what
platform, what version etc.

a

Thanks for your quick response!

Strings are not binary-safe.
Some characters are not allowed.
Can you explain what you mean - when they are being set from Ruby to the control (eg TextCtrl#value=) or retrieving user input from the GUI?
Ideally a short test case or example bit of code which shows what’s going on.

This script shows the different behavior for starting Wx::TextCtrl
values.

require “wx”

class TheApp < Wx::App
def on_init
frame = Wx::Frame.new(nil, -1, “TheApp”)
sizer = Wx::FlexGridSizer.new(2,4)

  string = "fooboo"

string = “foo\xC2\xA5boo”

string = “foo\xE2\x90\x80boo”

string = “foo\x1Fboo”

string = “foo\x0boo”

string = “foo\x95boo”

  puts "ruby string:\t#{string.length} #{string.inspect} 

#{string.unpack(‘H2’ * string.length).join(" “).upcase}”

  @text = Wx::TextCtrl.new(frame, -1, string, :style => 

Wx::TE_MULTILINE)
sizer.add(@text, 0, Wx::GROW|Wx::ALL, 4)
value = @text.get_value
puts “starting value:\t#{value.length} #{value.inspect}
#{value.unpack(‘H2’ * value.length).join(” “).upcase}”

  saveButton = Wx::Button.new(frame, -1, 'Save')
  saveButton.evt_button(saveButton.get_id) { | e | on_do_save }
  sizer.add(saveButton, 0, Wx::ALL, 4)

  frame.set_sizer(sizer)
  sizer.set_size_hints(frame)
  sizer.fit(frame)
  frame.show

end
def on_do_save
value = @text.get_value
puts “saved value:\t#{value.length} #{value.inspect}
#{value.unpack(‘H2’ * value.length).join(” “).upcase}”
end
end

TheApp.new.main_loop

C:>ruby script.rb
(different strings uncommented)

ruby string: 6 “fooboo” 66 6F 6F 62 6F 6F
starting value: 6 “fooboo” 66 6F 6F 62 6F 6F
saved value: 6 “fooboo” 66 6F 6F 62 6F 6F

ruby string: 8 “foo\302\245boo” 66 6F 6F C2 A5 62 6F 6F
starting value: 8 “foo\302\245boo” 66 6F 6F C2 A5 62 6F 6F
saved value: 8 “foo\302\245boo” 66 6F 6F C2 A5 62 6F 6F

ruby string: 9 “foo\342\220\200boo” 66 6F 6F E2 90 80 62 6F 6F
starting value: 9 “foo\342\220\200boo” 66 6F 6F E2 90 80 62 6F 6F
saved value: 9 “foo\342\220\200boo” 66 6F 6F E2 90 80 62 6F 6F

ruby string: 7 “foo\037boo” 66 6F 6F 1F 62 6F 6F
starting value: 7 “foo\037boo” 66 6F 6F 1F 62 6F 6F
saved value: 7 “foo\037boo” 66 6F 6F 1F 62 6F 6F

ruby string: 7 “foo\000boo” 66 6F 6F 00 62 6F 6F
starting value: 3 “foo” 66 6F 6F
saved value: 3 “foo” 66 6F 6F

ruby string: 7 “foo\225boo” 66 6F 6F 95 62 6F 6F
starting value: 0 “”
saved value: 0 “”

Some control characters work fine (e.g. \x1F).

Other control characters (e.g., \x00) cause the value to be truncated
before the character.

Still other control characters (e.g., \x95) cause the value to be
altogether empty.

Is this behavior on purpose? Is there a list of which control characters
do what?

If you want a code-oriented text editor, you probably want to use Wx::StyledTextCtrl as the base for your text control. It’s the same editing/highlighting component used by Scite.

I guess my noob side shown through. Thanks for pointing me to that
control.

I guess I need to read up before I open my mouth. Let the character
testing begin.

Wx::Clipboard
what DataObject, what platform, what version

Wx::DF_TEXT
Windows API: CF_OEMTEXT, CF_TEXT, CF_UNICODETEXT
Windows XP

I was just testing someone elses script. This object is complex! I
suspect it’s a combination of how Windows implements clipboard data
formats and the Wx UTF-8 requirement.

From tests of Windows native clipboard API calls, it seems they get
themselves confused about text display versus binary value.

It seems this is a hot issue for general wxWidgets as well.

Thanks.

-AH

Alexander Hawley wrote:

This script shows the different behavior for starting Wx::TextCtrl
values.

Thanks for the sample code. It seems to me there are a couple of
different issues here:

string = “foo\x0boo”

Embedded NUL characters. At the moment the wxRuby R.->C++ conversion
for String relies on C conventions - ie that the NUL character
terminates a string. So although both Ruby and wxWidgets permit Strings
with embedded NUL, they get truncated in conversion.

This is a bug, and I think this it’s fairly easy to fix in the wrapping

  • but it’s also quite far-reaching so we need to check the
    byte/character counts are right, so it doesn’t cause regressions
    elsewhere.

string = “foo\x95boo”

This is just isn’t a valid UTF-8 string. Presumably it makes sense in
some 8-bit encoding (eg ISO-8859-1) so you need to use Iconv or similar
to convert it before feeding it to wxRuby.

  @text = Wx::TextCtrl.new(frame, -1, string, :style => 

Wx::TE_MULTILINE)

The Wx::TextCtrl documentation states (although not very pointedly) that
the only control characters permitted are a newline. I think TextCtrl is
aimed only at natural language text, so it has to be StyledTextCtrl
(Scintilla) here.

I saw your email re STC, and have been trying something similar here.
I’m not sure, but I think STC (wxWidgets’ wrapping of Scintilla) is
making assumptions about a NUL character terminating a string; even if I
pass it the right stuff from Ruby, it’s still truncating it.

formats and the Wx UTF-8 requirement.

From tests of Windows native clipboard API calls, it seems they get
themselves confused about text display versus binary value.

It seems this is a hot issue for general wxWidgets as well.

Yes, getting the clipboard to work across platforms is a messy business,
because each platform uses different native encodings and a different
scheme to denote data types. What I find on OS X is that the raw data is
UTF16, but for higher-level calls, wxRuby & wxWidgets will do the
conversion. Even if I place DF_TEXT on the clipboard, I can only
retrieve DF_UNICODETEXT.

This isn’t resolved in other Ruby GUI libraries as well - of the other
two most popular libraries, GNOME2 doesn’t support Windows clipboard at
all, and Shoes doesn’t even try to offer that GUI convention.

Thanks for bringing this up. I’ll see what we can fix for the next
release, but I have a hunch it may not be possible to get it 100%
perfect b/c of all the other components involved. Some test cases may
really help, if possible, and I may focus on getting things most correct
for Ruby 1.9.

An example test for Clipboard is here:
http://wxruby.rubyforge.org/svn/trunk/wxruby2/tests/test_clipboard.rb

alex

It seems Wx::StyledTextCtrl is no better on the NUL character problem.

This is really weird, because Scite can definitely handle NULs.

This script shows the different behavior for starting Wx::StyledTextCtrl
values.

require “wx”

class TheApp < Wx::App
def on_init
frame = Wx::Frame.new(nil, -1, “TheApp”)
sizer = Wx::FlexGridSizer.new(2,4)

  string = "foo\x00boo"
  puts "ruby string:\t#{string.length} #{string.inspect} 

#{string.unpack(‘H2’ * string.length).join(" “).upcase}”

file = ‘null.txt’

fileContents = nil

File.open(file, ‘rb’) { |m_file|

fileContents = m_file.read

}

puts "ruby file contents:\t#{fileContents.length}

#{fileContents.inspect} #{fileContents.unpack(‘H2’ *
fileContents.length).join(" “).upcase}”

  @text = Wx::StyledTextCtrl.new(frame)
  @text.set_text(string)

@text.load_file(file)

  sizer.add(@text, 0, Wx::ALL, 4)
  value = @text.get_text
  puts "starting value:\t#{value.length} #{value.inspect} 

#{value.unpack(‘H2’ * value.length).join(" “).upcase}”

  saveButton = Wx::Button.new(frame, -1, 'Save')
  saveButton.evt_button(saveButton.get_id) { | e | on_do_save }
  sizer.add(saveButton, 0, Wx::ALL, 4)

  frame.set_sizer(sizer)
  sizer.set_size_hints(frame)
  sizer.fit(frame)
  frame.show

end
def on_do_save
value = @text.get_text
puts “saved value:\t#{value.length} #{value.inspect}
#{value.unpack(‘H2’ * value.length).join(” “).upcase}”
end
end

TheApp.new.main_loop

C:>ruby script.rb
(string versus file uncommented)

ruby string: 7 “foo\000boo” 66 6F 6F 00 62 6F 6F
starting value: 3 “foo” 66 6F 6F
saved value: 3 “foo” 66 6F 6F

ruby file contents: 7 “foo\000boo” 66 6F 6F 00 62 6F 6F
starting value: 3 “foo” 66 6F 6F
saved value: 3 “foo” 66 6F 6F

NUL characters (e.g., \x00) cause the value to be truncated before the
character.

Both Wx::StyledTextCtrl#set_text and Wx::StyledTextCtrl#load_file
exhibit the same problem.

Thanks

-AH

Alex F. wrote:

Alexander Hawley wrote:

string = “foo\x0boo”

Embedded NUL characters. At the moment the wxRuby R.->C++ conversion
for String relies on C conventions - ie that the NUL character
terminates a string. So although both Ruby and wxWidgets permit
Strings with embedded NUL, they get truncated in conversion.

I had a closer look at this. We can tweak the wrappings so that embedded
NUL characters are preserved as they are passed between Ruby and
wxWidgets.

However, the wxWidgets wrapping around Scintilla makes the assumption
that strings are terminated by NUL - so, even a NUL character is
entered, it can’t be retrieved. See

http://lists.wxwidgets.org/pipermail/wxpython-users/2004-September/031993.html

I had a look at adding special methods to bypass this problem, but no
joy yet.

What I suggest as a workaround is gsub-bing the string as it goes in and
out, and replacing NUL with the unicode character symbol-for-null.

string.gsub(/\x00/, “\xE2\x90\x80”)

By the way, other control characters are displayed as you describe in
Scite.

I’ve filed a bug to track this:

http://rubyforge.org/tracker/index.php?func=detail&aid=23814&group_id=35&atid=218

alex

However, the wxWidgets wrapping around Scintilla makes the assumption that strings are terminated by NUL - so, even a NUL character is entered, it can’t be retrieved.

A dependency/assumption in how wxWidgets uses C, I suspect.

Scite must not have the same C dependencies/assumptions as wxWidgets. It
can display most control characters okay.

What I suggest as a workaround is gsub-bing the string as it goes in and out, and replacing NUL with the unicode character symbol-for-null.

Already there. I’m already doing a string search of existing not safe
for UTF-8 characters.

By the way, other control characters are displayed as you describe in Scite.

Yup. I got the EOL display, whitespace display, control display methods
working nicely.

Thanks for all the work looking into it. And the formal bug.

-AH