Editing Text (#145)

bbazzarrakk · October 26, 2007, 3:04pm

The three rules of Ruby Q.:

Please do not post any solutions or spoiler discussion for this quiz
until
48 hours have passed from the time on this message.
Support Ruby Q. by submitting ideas as often as you can:

http://www.rubyquiz.com/

Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby T. follow the discussion. Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Eric M.

Have you ever wondered how a text buffer might be represented in a text
editor
or word processor? A simple string to represent text buffer isn’t
efficient
enough because inserting (i.e. typing) and deleting (backspace) in the
middle
would result in moving all of the text to the end for each operation. A
data
structure that can efficiently insert and delete in the middle is
needed.

The task is to implement data structures for efficiently editing text.
One
common data structure is a gap buffer:

Gap buffer - Wikipedia

Other options to consider include: ropes (see quiz #137), linked lists,
simple
strings, or a multi-level combination of data structures (i.e. for lines
vs.
characters in a line). There are many data structures that may work
efficiently
with simple editing operations, but not all of those data structures
will work
well for more complex functionality.

All of the basic operations occur around a text cursor. The minimal
operations
on/at the cursor should be:

Insert a character before or after the cursor.
Delete a character before or after the cursor and return the
deleted character.
Move the cursor left or right (crossing line boundaries if
necessary) and return the character or nil at the beginning
or end of the buffer.
Move the cursor up or down a line and return nil/false only if a
line boundary could not be crossed. The cursor may be placed in
the most natural column for the data structure.

Additional useful operations that you might find in a typical text
editor can
also be added:

Get current line and column numbers
Copy some amount of text before or after the cursor and return
this buffer.
Display some context around the cursor.
Cut some amount of text before or after the cursor and return
this buffer.
Paste a copy/cut buffer before or after the cursor.
Insert a file.
Write to a file.
Goto a specific line or column.
Goto the begin/end of the line/buffer.
Copy or cut to a specific line/column.
Filter some text through a ruby block.
Search (and possibly replace) using a regular expression.
Undo/redo.

Major bonus points for the following where gap buffers probably won’t
work:

Only store changes to a file.
Handle multiple efficient cursors in a text buffer.

Although the focus is on data structures and making the ruby DSL
equivalent to
unix “ed” or DOS “edlin”, a GUI could be added to make a
full-screen/window text
editor.

Here is a benchmark for testing that needs the minimal implementation
(#insert_before, #insert_after, #delete_before, #delete_after, #left,
#right,
#up, #down):

edit_test.rb

usage: ruby -r klass.rb edit_test.rb \

[ [ ] …]

require ‘benchmark’
require ‘test/unit/assertions’
include Test::Unit::Assertions

char = byte pre 1.9, each_char already defined in 1.9

unless “”.respond_to?(:each_char)
class String;alias_method(:each_char, :each_byte);end
end

iterations = ARGV.shift.to_i

while cursor = ARGV.shift
nlines = (ARGV.shift || 10000).to_i
ncolumns = (ARGV.shift || 100).to_i
n = nlines*ncolumns
chars = (?a…?z).to_a
line = (0…ncolumns).inject(“”) { |line, i|
line << chars[i%chars.length]
}
line[-1] = ?\n

iterations.times { eval(cursor).instance_eval {
  Benchmark.benchmark(
    "#{cursor}: #{nlines}x#{ncolumns}\n",16,nil,"total"
  ) { |b|

    total = b.report("insert_before") {
      nlines.times { line.each_char { |ch| insert_before(ch) } }
    }
    i = 0
    total += b.report("left") { i += 1 while left }
    assert_equal(n, i)
    i = 0
    total += b.report("right") { i += 1 while right }
    assert_equal(n, i)
    i = 0
    total += b.report("up") { i += 1 while up }
    assert_equal(nlines, i)
    i = 0
    total += b.report("down") { i += 1 while down }
    assert_equal(nlines, i)
    total += b.report("insert_after") {
      nlines.times { line.each_char { |ch| insert_after(ch) } }
    }
    i = 0
    total += b.report("delete_before") {
      i += 1 while delete_before
    }
    assert_equal(n, i)
    i = 0
    total += b.report("delete_after") {
      i += 1 while delete_after
    }
    assert_equal(n, i)

    [total]

  }
} }

end

bbazzarrakk · October 26, 2007, 3:15pm

James G. wrote:

All of the basic operations occur around a text cursor. The minimal
operations
on/at the cursor should be:

James, thanks for putting all these challenges together, and thanks to
the contributors too.

Just a terminology thing - the proper term is “caret”, not a “text
cursor”.

Todd

bbazzarrakk · October 26, 2007, 3:27pm

On Oct 26, 2007, at 8:15 AM, Todd B. wrote:

James G. wrote:

All of the basic operations occur around a text cursor. The minimal
operations on/at the cursor should be:

Just a terminology thing - the proper term is “caret”, not a “text
cursor”.

While I probably would have called it a caret as well, the term
doesn’t seem completely out of line:

Cursor - Wikipedia

Cursor (user interface) - Wikipedia

James Edward G. II

bbazzarrakk · October 26, 2007, 3:48pm

On Oct 26, 2007, at 8:43 AM, Todd B. wrote:

sure
no one got confused from your description.

Just to be clear, I didn’t write that quiz. Eric M. did.

James Edward G. II

bbazzarrakk · October 26, 2007, 3:43pm

James G. wrote:

On Oct 26, 2007, at 8:15 AM, Todd B. wrote:

While I probably would have called it a caret as well, the term
doesn’t seem completely out of line:

Not out of line in the least. Without resorting to cheesey emoticons
(wink wink, smiley, grin, et al), mine was a passing comment. I am sure
no one got confused from your description.

Again, thanks.

bbazzarrakk · October 26, 2007, 7:46pm

Todd B. wrote:

cursor".

Todd

When the Macintosh first came out, the little blinking vertical bar in
text editors was called the insertion point or caret, but, because of
its shape, it was also called the “stick”.

It doesn’t matter whether you see it as a caret or a stick, as long as
it provides sufficient motivation for the Quiz.

bbazzarrakk · October 26, 2007, 4:27pm

On 10/26/07, James Edward G. II [email protected] wrote:

doesn’t seem completely out of line:

Cursor - Wikipedia

Cursor (user interface) - Wikipedia

Actually, I’d say that caret is a UI/View related term (its the name
of the ^ character, traditionally used in blue-pencil editing to mark
a place to insert test.

Cursor is kind of both a view and a model term, and I think we’re
talking about a model in this quiz.

That said, in general, when I’ve looked at similar code, the analogous
concept usually used is a selection, which represents a contiguous run
of zero or more characters within the buffer. The quiz seems to be
using a degenerate version of this where the length is constrained to
be zero.

–
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

bbazzarrakk · October 27, 2007, 11:14pm

On 10/26/07, Rick DeNatale [email protected] wrote:

While I probably would have called it a caret as well, the term
Cursor is kind of both a view and a model term, and I think we’re
talking about a model in this quiz.

Correct. I’ve also seen the term cursor being used with databases
(pointer to what row is being operated on). It has also been used to
mean the same as what rubyists call an “external iterator” (ruby
mainly uses “internal iterators” or what others might call visitors).

Before mice came on the scene the symbol representing the text
insertion point was also called a “cursor”. I didn’t realize that
“cursor” is now mainly used for the mouse now and the text cursor has
been demoted to “caret”. I usually just say “mouse pointer” and “text
cursor”. According to how the term “cursor” is used in data
structures, I think the text cursor/caret deserves to use this term
more than a mouse pointer/cursor does.

That said, in general, when I’ve looked at similar code, the analogous
concept usually used is a selection, which represents a contiguous run
of zero or more characters within the buffer. The quiz seems to be
using a degenerate version of this where the length is constrained to
be zero.

Don’t let the test in this quiz stop you. I just provided a benchmark
for the simple stuff. Feel free to implement other text editor
operations in addition to the simple operations that the benchmark
uses. And don’t let the benchmark restrict your API. If the model
that the benchmark assumes doesn’t match your model (i.e. a selection
in your case?), change the benchmark test to what you need. Also,
achieving the best absolute performance on this benchmark isn’t
necessarily the most important, but you should get reasonable big-O
performance. I’d like to see what data structures some of you come up
with and how far you can take them in terms of functionality. I think
there are lots of solutions with a variety of data structures.

Have fun!

Eric

bbazzarrakk · October 29, 2007, 2:38pm

On 10/26/07, Ruby Q. [email protected] wrote:

A simple string to represent text buffer isn’t efficient
enough because inserting (i.e. typing) and deleting (backspace) in the middle
would result in moving all of the text to the end for each operation.

Here is an inefficient string-based solution:

class StringCaret
# cursor/caret is between char i-1 and i
def initialize(data=“”, i=0)
@data = data
@i = i
end
def insert_before(ch)
@data[@i,0] = (“” << ch)
@i += 1
end
def insert_after(ch)
@data[@i,0] = (“” << ch)
end
def delete_before
@i.nonzero? and @data.slice!(@i-=1)
end
def delete_after
@data.slice!(@i)
end
def left
@i.nonzero? and @i-=1
end
def right
@i<@data.length and @i+=1
end
def up
while @i.nonzero?
@i -= 1
break(true) if @data[@i]==?\n
end
end
def down
while @i<@data.length
break(@i+=1) if @data[@i]==?\n
@i += 1
end
end
end

The benchmark results look like this on my machine:

ruby -r string.rb test.rb 2 StringCaret.new 1000 100 StringCaret.new 2000 100
StringCaret.new: 1000x100
…
total 7.800000 0.030000 7.830000 ( 7.851117)
StringCaret.new: 2000x100
…
total 26.810000 0.100000 26.910000 ( 27.015445)

The run-time almost quadrupled when the data-size doubled, so this is
O(n**2).

An O(n) solution with this minimal functionality is doable in about
the same number of lines as above. But, you have to use another data
structure…

Eric

bbazzarrakk · October 29, 2007, 7:50pm

Now that you’ve submitted a reference implementation, it looks much
more interesting to try to optimize. I bet you’ll get a better
response now.

bbazzarrakk · October 29, 2007, 8:20pm

On 10/29/07, Philip G. [email protected] wrote:

Now that you’ve submitted a reference implementation, it looks much
more interesting to try to optimize. I bet you’ll get a better
response now.

Yes, I probably should have put something like this in the quiz to begin
with.

bbazzarrakk · October 30, 2007, 5:15am

On 10/26/07, Ruby Q. [email protected] wrote:

The task is to implement data structures for efficiently editing text. One
common data structure is a gap buffer:
    http://en.wikipedia.org/wiki/Gap_buffer
Other options to consider include: ropes (see quiz #137), linked lists, simple
strings, or a multi-level combination of data structures (i.e. for lines vs.
characters in a line). There are many data structures that may work efficiently
with simple editing operations, but not all of those data structures will work
well for more complex functionality.

Since I haven’t seen any solutions yet, I’d thought I’d elaborate on
some possible solutions I was thinking of:

conventional gap buffer implemented with a string. The string would
have 3 parts (a couple indices might be used for the markers): before
the cursor, the “gap” (containing garbage), and after the cursor.
When you have no more space in the gap, copy the data to a new string
using a larger gap. Think about how the gap should be sized. I
believe most text editors use a gap buffer.
gap buffer implemented with two strings. Represent the text before
the cursor with one string and the text after the cursor with another
string that is reversed. All operations around the cursor correspond
to operations at the ends of these strings - O(1). Unlike the
conventional gap buffer, you don’t have to deal with reallocation (gap
filing up). Instead let ruby do its string reallocation when a
string’s capacity fills up. The memory efficiency might not be as
good as the conventional gap buffer, though.
gap buffer with deferred gap movement. The “gap” is really only
useful when making edits (insertions/deletions). For simple cursor
movements, you don’t have to “move” the gap immediately. The cursor
position could be changed independently of the gap. But, you’ll have
to synchronize the gap to the cursor position once an edit is needed.
Random access cursor movement can also be done in O(1) time, but
you’ll still have to pay for it if you immediately make an edit at
that new position.
use a rope implementation (quiz #137). You could start with the
inefficient string solution I gave earlier and operate on a rope
instead (making new ones if using an immutable implementation). It
may be difficult to make the solution as fast as the gap buffers for
simple stuff (likely slower by O(log(n))), but ropes will probably be
more powerful for complex functionality. I believe ropes were
invented for text editing (something called Coral I believe).
linked lists. The text could be represented as a linked list of
characters. The cursor position might correspond to one of the linked
list nodes or links. All operations around the cursor should be O(1).
An advantage over gap buffers is that you can efficiently implement
multiple cursors (possibly from different editing panes or even
different users). The downside is the memory usage is higher and
random access might be higher.
use a two level data structure. One data structure could be used
for lines and another for the characters within each line. All of the
above data structures apply to both. For managing lines, you’d need
to use arrays instead of strings in the data structure since the
element is a line instead of a character. Using a two-level data
structure will give better performance for line operations (i.e.
up/down).

Enjoy!

Eric

bbazzarrakk · October 30, 2007, 7:00am

From: “Eric M.” [email protected]

Since I haven’t seen any solutions yet, I’d thought I’d elaborate on
some possible solutions I was thinking of:
[…]

linked lists. The text could be represented as a linked list of
characters. The cursor position might correspond to one of the linked
list nodes or links. All operations around the cursor should be O(1).
An advantage over gap buffers is that you can efficiently implement
multiple cursors (possibly from different editing panes or even
different users). The downside is the memory usage is higher and
random access might be higher.

I was hoping to use this quiz to play around with mmap, but I don’t
think I’ll have time this week.

For fun, I was going to read the file-to-be-edited into a memory
mapped file, breaking the data up into VM-page sized blocks (4K, on
my system I think.)

The blocks would be linked together, and a given block would not
be required to be full. Thus, similar to the gap buffer technique,
if an insert occurred in a full block, it could be split into two
or more partially full blocks. (Adjacent partially full blocks
could be coalesced in such a way as to prevent the accumulation of
lots of partially full blocks, wasting space. (Haven’t fully
thought this through, but it doesn’t seem like it would be
problematic, unless there’s a snag of some sort I’m overlooking.))

Then, I wanted to play with a B-tree to support fast random access
into the block list. I didn’t quite figure this out, but it seems
there should be a way to construct the B-tree such that one could
index into it using the desired file offset, and wind up at the
appropriate block. Seems like I would need to store the byte
count of how much data is held by a block in the B-tree leaf nodes,
and as inserts & deletes occurred these sizes would change, and
would propagate up to the parent, to the root of the B-tree. In
other words, each B-tree node would know the byte count of the
amount of data held by its children.

With a 4K page size, each B-tree node could have lots of children,
and thus the tree would tend to be very shallow. However because
I think I’d want to store the child node sizes in a relative way,
it would preclude doing a binary search at any given node level to
find a desired offset; one would have to sum the sizes at a given
node level to locate the child node one was interested in. . . .
Still the trade-offs seem probably reasonable because most
relative cursor movement could occur without searching the tree
(the blocks are linked together). So the B-tree would be used
just for random-access jumps through the file.

Alternatively, it might be neat to instead construct the B-tree to
organize the file by line offsets. Random access by line number
would probably make more sense in a text editor than random access
by byte offset anyway.

Oh well - anyway, thanks for the quiz. Even though I’ll not
likely find time to try implementing the above, it’s been fun to
think about!

Regards,

Bill

bbazzarrakk · October 30, 2007, 1:56pm

On Oct 30, 2007, at 5:47 AM, Robert D. wrote:

list nodes or links. All operations around the cursor should be O
I wanted to finish what I began with ropes, and I have a long WE ahead
I’m willing to extend this quiz a week if that’s what people want.
In my experience that doesn’t generally fetch more solutions, but I
too am strapped for time and would like to try it. All in favor?

James Edward G. II

bbazzarrakk · October 30, 2007, 11:48am

On 10/30/07, Bill K. [email protected] wrote:

multiple cursors (possibly from different editing panes or even
different users). The downside is the memory usage is higher and
random access might be higher.

I was hoping to use this quiz to play around with mmap, but I don’t
think I’ll have time this week.
Exactly the same here, maybe this is one of the occasions to
prolongate for a week?
I wanted to finish what I began with ropes, and I have a long WE ahead
;).

Robert

bbazzarrakk · October 30, 2007, 2:11pm

On 10/30/07, James Edward G. II [email protected] wrote:

On Oct 30, 2007, at 5:47 AM, Robert D. wrote:

Exactly the same here, maybe this is one of the occasions to
prolongate for a week?
I wanted to finish what I began with ropes, and I have a long WE ahead

I’m willing to extend this quiz a week if that’s what people want.
In my experience that doesn’t generally fetch more solutions, but I
too am strapped for time and would like to try it. All in favor?

I’m in favor. Another quiz during rubyconf may not be a good idea
anyways. I was hoping to at least see some simple solutions right
now. A minimal solution can be done in about 35 lines of code (most
of those lines are class/def/end).

Eric

bbazzarrakk · October 30, 2007, 4:24pm

On 10/30/07, James Edward G. II [email protected] wrote:

In my experience that doesn’t generally fetch more solutions, but I
too am strapped for time and would like to try it. All in favor?

I’m in favor.

Three votes is good enough for me. We’ll add a week for this quiz.

Robert, you are now honor bound to solve it.
Not much to lose;) but I here the message
R.

bbazzarrakk · October 30, 2007, 3:28pm

On Oct 30, 2007, at 8:09 AM, Eric M. wrote:

I’m in favor.

Three votes is good enough for me. We’ll add a week for this quiz.

Robert, you are now honor bound to solve it.

James Edward G. II

bbazzarrakk · October 30, 2007, 4:26pm

s/here/hear/

bbazzarrakk · October 30, 2007, 8:34pm

One Quiz I would love to participate in… If onyl I had the time.

How Text Editors work has always been an interest of mine but sadly one
I’ve
never really have been able to pursue.

Considering this was a long time ago (before I learned Ruby). The idea
that I
had (yes C was my primary language at the time), was just to test the
concept.

Was to create a two-way linked list the length of every character in the
file.
Where each node of the list contained a simple set of boolean values
indicating that the current character is the start/end of a
paragraph/line/word, if it is a space or \n e.t.c. And eventually just
for fun
work on a way to swap parts of the file into and out of memory and a
temporary
file.

I figured that it would be grossly inefficient if every character in the
text
was a segment like that in memory but a little less in depth to code a
‘Hmm I
wonder if this would work’ scratch project then mucking through a BST

I never really had time to play with it through, considering if I wanted
to
use line editor I would just use /bin/ed instead of /usr/bin/vi hehe.
Although
I do admit if I ever did have time… I would probably still do it in C
rather
then Ruby because I’m more comfortable with C’s I/O.

TerryP.