Editing Text (#145)

First attempt

  • plain gap buffer implementation without much optimization
  • only basic operations
  • O(n)

Regards
Holger

Solution to ruby quiz #145

Ruby Quiz - Editing Text (#145)

by Holger

GapBuffer works similar to StringCaret

Some remarks:

- if gap size is zero before insert it will be set to 64

- data buffer will never shrink → gap might be very large

- lazy up and down methods, but cursor remains in same column (if fits

in
line)

class GapBuffer

data is in @data

gap starts @gap_start

gap ends @gap_start + @gap_len

def initialize(data=“”, i=0)
@data = data
@gap_start = i
@gap_len = 0
@GAP = " "*64
end

def insert_before(ch)
if @gap_len.zero?
@data[@gap_start, 0] = @GAP
@gap_len = @GAP.length
end
@data[@gap_start] = ch
@gap_start += 1
@gap_len -= 1
end

def insert_after(ch)
if @gap_len.zero?
@data[@gap_start, 0] = @GAP
@gap_len = @GAP.length
end
@data[@gap_start+@gap_len-1] = ch
@gap_len -= 1
end

def delete_before
return if @gap_start.zero?
@gap_start -= 1
@gap_len += 1
@data[@gap_start]
end

def delete_after
return if @gap_start+@gap_len >= @data.length
@gap_len += 1
@data[@gap_start+@gap_len-1]
end

def left
return if @gap_start.zero?
@data[@gap_start+@gap_len-1] = @data[@gap_start-1]
@gap_start -= 1
@data[@gap_start]
end

def right
return if @gap_start+@gap_len>[email protected]
@data[@gap_start] = @data[@gap_start + @gap_len]
@gap_start += 1
@data[@gap_start - 1]
end

def up
col = column
cursor = @gap_start-col
return if cursor.zero?
cursor_line = @data.rindex(?\n, cursor-2)
cursor_line = 0 if cursor_line.nil?
cursor_line += col+1
if cursor_line > cursor-1
cursor_line = cursor-1
end
left while @gap_start > cursor_line
true
end

def down
col = column
cursor = @data.index(?\n, @gap_start + @gap_len)
return if cursor.nil?
cursor_line = cursor+1+col
cursor = @data.index(?\n, cursor+1)
cursor = @data.length if cursor.nil?
cursor_line = cursor if cursor_line > cursor
right while @gap_start + @gap_len < cursor_line
true
end

def position
@gap_start
end

def column
lbreak = @data.rindex(?\n, @gap_start-1)
lbreak.nil? ? @gap_start : (@gap_start-(lbreak+1))
end

def to_s
return @data[0, @gap_start]+@data[@gap_start+@gap_len,
@data.length-@gap
_start-@gap_len]
end
end

GapBuffer.new: 1000x100
insert_before 0.266000 0.000000 0.266000 ( 0.266000)
left 0.328000 0.000000 0.328000 ( 0.328000)
right 0.406000 0.000000 0.406000 ( 0.407000)
up 0.375000 0.000000 0.375000 ( 0.375000)
down 0.422000 0.000000 0.422000 ( 0.421000)
insert_after 0.391000 0.000000 0.391000 ( 0.391000)
delete_before 0.234000 0.000000 0.234000 ( 0.234000)
delete_after 0.313000 0.000000 0.313000 ( 0.313000)
total 2.735000 0.000000 2.735000 ( 2.735000)
GapBuffer.new: 2000x100
insert_before 0.547000 0.000000 0.547000 ( 0.547000)
left 0.672000 0.000000 0.672000 ( 0.671000)
right 0.796000 0.000000 0.796000 ( 0.797000)
up 0.735000 0.000000 0.735000 ( 0.735000)
down 0.875000 0.000000 0.875000 ( 0.875000)
insert_after 0.797000 0.000000 0.797000 ( 0.796000)
delete_before 0.468000 0.000000 0.468000 ( 0.469000)
delete_after 0.610000 0.000000 0.610000 ( 0.609000)
total 5.500000 0.000000 5.500000 ( 5.499000)
GapBuffer.new: 3000x100
insert_before 0.890000 0.000000 0.890000 ( 0.891000)
left 1.016000 0.000000 1.016000 ( 1.015000)
right 1.188000 0.000000 1.188000 ( 1.188000)
up 1.140000 0.000000 1.140000 ( 1.140000)
down 1.360000 0.000000 1.360000 ( 1.360000)
insert_after 1.156000 0.016000 1.172000 ( 1.171000)
delete_before 0.797000 0.000000 0.797000 ( 0.813000)
delete_after 0.953000 0.000000 0.953000 ( 0.953000)
total 8.500000 0.016000 8.516000 ( 8.531000)

On 10/30/07, James Edward G. II [email protected] wrote:

I’m willing to extend this quiz a week if that’s what people want.
In my experience that doesn’t generally fetch more solutions, but I
too am strapped for time and would like to try it. All in favor?

I’m sorry, I won’t have time this week anyway (my second baby was born
on Monday, so not much time to use the computer :-). This was an
interesting quiz, I would have liked to have time to give it a try.

Jesus.

On Nov 1, 2007, at 7:24 AM, Jesús Gabriel y Galán wrote:

On 10/30/07, James Edward G. II [email protected] wrote:

I’m willing to extend this quiz a week if that’s what people want.
In my experience that doesn’t generally fetch more solutions, but I
too am strapped for time and would like to try it. All in favor?

I’m sorry, I won’t have time this week anyway (my second baby was born
on Monday, so not much time to use the computer :-).

Congratulations on the new child!

James Edward G. II

On 10/26/07, Ruby Q. [email protected] wrote:

    http://en.wikipedia.org/wiki/Gap_buffer

Here is a strange solution of mine.

I start by partially implementing a string class that works equally
well on the front and the back (double ended queue string). My
array.c patch from a couple years ago did something like this, but the
data structure below is a bit more memory efficient since it can share
a single “gap” and recirculates unused space better. Basically, the
implementation below is a circular buffer that reallocates when it is
full. It would be great if the standard String and Array were
symmetrical like this. It would increase the ways to use simple
strings and arrays.

Next I use this double ended queue string to implement something like
a gap buffer. You can think of the gap as being at the ends of of
this double-ended string. Text before the cursor is at the end of the
string and text before the cursor is at the beginning.

The performance is O(n), but not nearly as fast as some a simple
solution (5X slower). If String was symmetical, I would use that and
it would be a different story.

If I have time, I may try this DQString in a linked-list
data-structure…

dqstring.rb:

characters can be in either of these ranges:

gapped : @begin…length, 0…@end

contiguous : @begin@end

inheriting String only for memory efficiency

inherited methods won’t necessarily make sense

class DQString < String
def initialize(data=“”)
@begin = 0
@end = data.length
super(data)
end
def inspect
“#<DQString: begin=#{@begin}, end=#{@end}, #{super}>”
end
def length
if (len = @end-@begin)>=0
len
else
super+len
end
end
def << (ch)
if @end==size
if @begin>1
# use the front, making a gap in the middle
self[0] = ch
@end = 1
self
else
@end += 1
# let ruby realloc when needed
super(ch)
end
else
self[@end] = ch
@end += 1
if @end==@begin
# double the size when the gap becomes empty
@begin += size
self.replace(self+self)
else
self
end
end
end
def >> (ch)
if @begin.zero?
@begin = size-1
if @end<@begin
# use the back, making a gap in the middle
self
else
# double the size to make room at the front
@end += size
self.replace(self+self)
end
else
@begin -= 1
if @begin==@end
# double the size when the gap becomes empty
@begin += size
self.replace(self+self)
else
self
end
end
ensure
self[@begin] = ch
end
def pop
if @end==@begin
nil
else
@end -= 1
ch = self[@end]
if (len = @end-@begin)>=0
# remove excess trailing space if too much
self.slice!(-len,len) if size-@end>(len<<1)
elsif @end.zero?
len = (@end=size)-@begin
if @begin>(len<<1)
# remove excess leading space
self.slice!(0, len)
@begin -= len
@end -= len
end
end
ch
end
end
def shift
if @begin==@end
nil
else
ch = self[@begin]
@begin += 1
if (len = @end-@begin)>=0
if @begin>(len<<1)
# remove excess leading space
self.slice!(0, len)
@begin -= len
@end -= len
end
elsif @begin==size
# remove excess trailing space if too much
self.slice!(-@end,@end) if (@end<<1)+len<0
@begin = 0
end
ch
end
end
end

dqcursor.rb:

require ‘dqstring’

class DQCursor
def initialize
@data = DQString.new
@nafter = 0
end
def insert_before(ch)
@data << ch
end
def insert_after(ch)
@nafter += 1
@data >> ch
end
def delete_before
@data.length>@nafter and @data.pop
end
def delete_after
@nafter.nonzero? and (@nafter-=1; @data.shift)
end
def left
@data.length>@nafter and (@nafter+=1; @data >> @data.pop)
end
def right
@nafter.nonzero? and (@nafter-=1; @data << @data.shift)
end
def up
nbefore = @data.length-@nafter
while nbefore.nonzero?
nbefore -= 1
@data >> ([email protected])
return(true) if ch==?\n
end
ensure
@[email protected]
end
def down
while @nafter.nonzero?
@nafter -= 1
@data << ([email protected])
return(true) if ch==?\n
end
end
end

On 10/30/07, Holger [email protected] wrote:

line)
Hi Holger,

Thanks for the solution. One comment though. This solution is not
O(n). Keep in mind that every time you increase the gap (insert in
the middle of a string), it is an O(n) operation (written in C
though). For each 64 characters that you insert, you’ll hit this
case. For insert, you have one component that is O(n) and another
that is O(n*n/64)~O(n2). For large n, the n2 component will
dominate, so it is O(n2). For the sizes you are looking at, the
n
2 component isn’t dominating because its coefficient is small (1/64

  • insert written in C).

With a small change, you can make this order n. Hint: don’t increase
the gap by a constant amount.

Eric

James Edward G. II wrote:

Congratulations on the new child!

James Edward G. II

Aye congratulations !

TerryP.

James Edward G. II wrote:

Congratulations on the new child!

James Edward G. II

Aye congratulations !

TerryP.

Hi Eric,

you’re right, I missed this point with my first solution. One simple way
to
improve this – as you mentioned in your hint – use a fraction A of the
complete gap length. This way we will have O(n).

Nevertheless, I wanted to send in a second solution, which will take
care of
all the other cases (better multi-lines, …) this time with two strings
as
data buffer. This solution is still quite handy, but unfortunately not
yet
finished. I think, with two buffers (first before the gap, second after)
the
O(n) is always valid.

Unfortunately, I think I don’t have enough time to complete the full
solution, maybe the first part with two strings.

Holger

2007/11/2, Eric M. [email protected]:

My solution is a deque… The text before the cursor is kept
left-to-right, while the part after the cursor is reversed. This makes
for some very simple code for the basic operations, using mostly
push/pop. It might be worthwhile to not reverse the data, and make the
code slightly more assymetric by combining the push/pop with
unshift/shift. (Didn’t try that, though…)

Initially, the “left” method was @post.push(@prev.pop), with the
“right” method similar. This proved to be terribly slow, so I kept a
“temporary” cursor that would collapse multiple left/right operations
into a single “sift” (i.e. moving n chars from one to the other). The
“sync” calls ensure that we do that sift before other operations.

The most complex methods (and slowest) here are “up” and “down”.
Breaking the internals down into multiple lines rather than just the
@prev/@post pair, or somehow keeping track of the newlines would help
with the speed of up/down, but I got lazy. =)

class DequeBuffer
def initialize(data = “”, i = 0)
@prev, @post = data[0, i].unpack(“c*”),
data[i…-1].unpack(“c*”).reverse
@curs = 0
end

def insert_before(ch)
sync
@prev.push(ch)
end

def insert_after(ch)
sync
@post.push(ch)
end

def delete_before
sync
@prev.pop
end

def delete_after
sync
@post.pop
end

def left
@curs -= 1 if @curs > [email protected]
end

def right
@curs += 1 if @curs < @post.length
end

def up
sync
c = @prev.rindex(?\n)
if c
c -= @prev.length
sift© # move to end of prev line
d = (@prev.rindex(?\n) || -1) - @prev.length - c
sift(d) if d < 0 # move to column
true
end
end

def down
sync
c = @post.rindex(?\n)
if c
c -= @post.length
sift(-c) # move to start of next line
d = (@post.rindex(?\n) || -1) - @post.length - c
sift(-d) if d < 0 # move to column
true
end
end

def to_s
(@prev + @post.reverse).pack(“c*”)
end

private

def sift(n)
if n < 0
@post.concat(@prev.slice!( n, -n).reverse)
elsif n > 0
@prev.concat(@post.slice!(-n, n).reverse)
end
end

def sync
sift(@curs)
@curs = 0
end
end

Running the test using simple string solution as comparison:

ruby -r deque.rb edit_test.rb 1 Simple.new 2000 100 DequeBuffer.new 2000
100
Simple.new: 2000x100
insert_before 0.890000 0.000000 0.890000 ( 0.889501)
left 0.170000 0.000000 0.170000 ( 0.173849)
right 0.190000 0.000000 0.190000 ( 0.184943)
up 0.200000 0.000000 0.200000 ( 0.204166)
down 0.210000 0.000000 0.210000 ( 0.205926)
insert_after 4.790000 0.010000 4.800000 ( 4.802009)
delete_before 3.480000 0.000000 3.480000 ( 3.484238)
delete_after 1.760000 0.010000 1.770000 ( 1.764394)
total 11.690000 0.020000 11.710000 ( 11.709026)
DequeBuffer.new: 2000x100
insert_before 0.280000 0.000000 0.280000 ( 0.285044)
left 0.180000 0.000000 0.180000 ( 0.182072)
right 0.160000 0.000000 0.160000 ( 0.160996)
up 1.530000 1.850000 3.380000 ( 3.373222)
down 0.780000 0.920000 1.700000 ( 1.714187)
insert_after 0.290000 0.000000 0.290000 ( 0.285273)
delete_before 0.250000 0.000000 0.250000 ( 0.251946)
delete_after 0.250000 0.000000 0.250000 ( 0.257460)
total 3.720000 2.770000 6.490000 ( 6.510201)

On 10/26/07, Ruby Q. [email protected] wrote:

The task is to implement data structures for efficiently editing text. One
common data structure is a gap buffer:

    http://en.wikipedia.org/wiki/Gap_buffer

Other options to consider include: ropes (see quiz #137), linked lists, simple
strings, or a multi-level combination of data structures (i.e. for lines vs.
characters in a line). There are many data structures that may work efficiently
with simple editing operations, but not all of those data structures will work
well for more complex functionality.

Here is another solution :

http://rubyquiz.com/hosted_solutions/145/eric/linked.rb

http://rubyquiz.com/hosted_solutions/145/eric/edit_test2.rb

In addition to the minimal functionality, I implemented a method that
inserts a file in the text (#read). But, it doesn’t read the entire
file into memory (it actually only needs to hold one character of the
file at a time). Only edits to the file are held in memory.

I implemented this by using a linked list of text buffers. Each
buffer is a gap buffer or a slice of an IO (begin/end positions).
Although the gap buffer I implemented (using two strings) is simple
and fast in ruby, it isn’t as memory efficient as a normal gap buffer
since each string effectively has a gap (where a normal gap buffer has
one shared gap between before/after text).

I changed the API for this implementation. All of the methods also
take a lambda/proc that is called when switching from one buffer to
another (updating the current cursor object). Another possibility
which keeps the original API is to add a wrapper class to do this, but
it won’t be nearly as efficient because of the extra layer (the
performance cost of the linked list is almost zero using the new API).
I also added some testing of file "read"s and
moving/inserting/deleting through files. SplitGapCursor can actually
be used with the old or new benchmarks (just don’t link it to another
buffer). Here are some results (total measures the same as before and
ftotal measures operations on file text) :

SplitGapCursor: 4000x100
insert_before 0.360000 0.000000 0.360000 ( 0.359155)
left 0.630000 0.000000 0.630000 ( 0.632066)
right 0.620000 0.000000 0.620000 ( 0.623372)
up 0.480000 0.000000 0.480000 ( 0.470801)
down 0.480000 0.000000 0.480000 ( 0.478180)
insert_after 0.340000 0.000000 0.340000 ( 0.343666)
delete_before 0.500000 0.000000 0.500000 ( 0.496911)
delete_after 0.500000 0.000000 0.500000 ( 0.501599)
read 0.000000 0.000000 0.000000 ( 0.000031)
up 0.480000 0.000000 0.480000 ( 0.475890)
read 0.000000 0.000000 0.000000 ( 0.000025)
delete_before 0.970000 0.000000 0.970000 ( 0.971401)
right 0.510000 0.000000 0.510000 ( 0.502776)
left 0.560000 0.000000 0.560000 ( 0.559028)
down 0.550000 0.000000 0.550000 ( 0.552482)
up 0.520000 0.000000 0.520000 ( 0.516446)
delete_after 0.920000 0.000000 0.920000 ( 0.920615)
total 3.910000 0.000000 3.910000 ( 3.905750)
ftotal 4.030000 0.000000 4.030000 ( 4.022773)

Notice the read time (0). But, I used StringIO instead of a real file
for testing.

On 11/5/07, Matthew M. [email protected] wrote:

My solution is a deque… The text before the cursor is kept
left-to-right, while the part after the cursor is reversed. This makes
for some very simple code for the basic operations, using mostly
push/pop.

I’m glad to see someone try an Arrays instead of a Strings. You’d
think the run-time would be similar between the two since the number
of method calls should be the same (#<< and #slice(-1) instead of
#push and #pop). But, the memory should be about 4X-8X though since
each array slot takes 32/64-bit object pointer/handle instead of an
8-bit character.

It might be worthwhile to not reverse the data, and make the
code slightly more assymetric by combining the push/pop with
unshift/shift. (Didn’t try that, though…)

Don’t expect very much performance since unshift is usually O(n).

Initially, the “left” method was @post.push(@prev.pop), with the
“right” method similar. This proved to be terribly slow, so I kept a
“temporary” cursor that would collapse multiple left/right operations
into a single “sift” (i.e. moving n chars from one to the other). The
“sync” calls ensure that we do that sift before other operations.

Odd. I tried both ways with String and it was faster to do @post <<
@prev.slice(-1). Since the operations I used were O(1), I think it
boiled down to minimizing the number of method calls. Using deferred
“sifting” resulted in more complexity and method calls. I wouldn’t be
surprised if there is some performance bug with @post.push(@prev.pop
or return).

The most complex methods (and slowest) here are “up” and “down”.
Breaking the internals down into multiple lines rather than just the
@prev/@post pair, or somehow keeping track of the newlines would help
with the speed of up/down, but I got lazy. =)

These are way too slow. It looks O(n**2). You should look to see
what happens when you double the size. I think you might have hit an
Array COW (copy-on-write) performance/memory bug(/leak?). I’ve seen
plenty of them.

Why not use the deferred “sifting” like you do for left/right instead?
This would avoid the problem.

Eric

Why not use the deferred “sifting” like you do for left/right instead?
This would avoid the problem.

Ten minutes after I posted my solution, I asked myself the same thing.
=)

I’ll try a rewrite if I get a chance, and look into your other comments
as well.

Sort of late, but here’s an approach I call a “patch buffer”.
Basically, the idea is to accumulate edits as cheaply as possible using
a sort of “patch,” then perform them all when the cursor moves.
One-character movement could be implemented by editing the patch, but
that’s more work… There are almost certainly some off-by-one bugs
lurking in it, but luckily the benchmark test doesn’t mix operations
very much.

On Thursday 08 November 2007 12:12 pm, Sean S. wrote:

Sort of late, but here’s an approach I call a “patch buffer”.
Basically, the idea is to accumulate edits as cheaply as possible using
a sort of “patch,” then perform them all when the cursor moves.

Nice idea!

(I didn’t look at the code yet, but I can imagine what you might have
done.)

Randy K.

On 11/8/07, Sean S. [email protected] wrote:

Sort of late, but here’s an approach I call a “patch buffer”.
Basically, the idea is to accumulate edits as cheaply as possible using
a sort of “patch,” then perform them all when the cursor moves.
One-character movement could be implemented by editing the patch, but
that’s more work… There are almost certainly some off-by-one bugs
lurking in it, but luckily the benchmark test doesn’t mix operations
very much.

If you start mixing edits and movements, I don’t think the performance
will
be very good because a fixup is O(n) after some edits. Consider a
sequence
of n (insert_before;left). Each left will do a full fixup making this
sequence O(n**2).

Eric

I wrote:

(try attached).

X-(

Can you hear me now?

Eric M. wrote:

If you start mixing edits and movements, I don’t think the performance
will
be very good because a fixup is O(n) after some edits. Consider a
sequence
of n (insert_before;left). Each left will do a full fixup making this
sequence O(n**2).

It’s about 5x slower than a gap buffer in this case for the 1000x100
size, but breaks even by the time you do 20 insert_before()s between
left()s. On the other hand, I think any array-based approach has an
O(n^2) worst case – there’s always some locality assumption. For
example, a gap buffer will be quadratic for a sequence of n
(insert_before; up; insert_after; down) on a string with longish lines,
since it has to shuffle half the buffer back and forth (try attached).
The patch approach seems reasonable given the limitations of working in
a high-level interpreted language with fast strings, though it wouldn’t
make sense in C.