Another Couch potato question: Dealing with classic concurrency conflicts

dubstep · May 20, 2011, 12:50pm

Hi All,

So now (with some starter help I got here), I played around with Couch
Potato enough to see that I really like it. I have one big problem
though and this is really critical for what I am developing.

Apache boast that CouchDB does not lock the DB documents as SQL locks
tables, and that users always get their requests granted via a queue.
But what about the classic concurrency conflicts that SQL uses locks for
in the first place? for example, suppose a document
that each client reads, increases a field in the document by some value
and then saves the document, as you probably know, if two clients read
the document and then both save it, only the increment done by one of
them will be actually take effect instead of the sum of both.

So I wonder what is the best practice for handling this kind of
situations with Couch Potato or with CouchDB at all? I really don’t want
to have to use SQL just because of that…

Many thanks,

Oren

orenshani · May 20, 2011, 2:06pm

On Fri, May 20, 2011 at 12:50 PM, Oren S. [email protected]
wrote:

Apache boast that CouchDB does not lock the DB documents as SQL locks
tables, and that users always get their requests granted via a queue.

First of all, SQL does not lock tables. SQL is a declarative language
which describes relations. There are various RDBMS around which use
variants of SQL. SQL itself has no knowledge of locking. Locking is
something which happens in a RDBMS to ensure certain transactional
properties. This also depends on the concurrency level chosen at a
time.

For example, in the default concurrency level PostgreSQL and Oracle do
not block readers if an update is under way (called MVCC). Readers
simply see an older version of the record - which seems to be similar
to what CouchDB does:
http://couchdb.apache.org/docs/overview.html

But what about the classic concurrency conflicts that SQL uses locks for
in the first place? for example, suppose I have a field in some document
that each client should read, increase by some value and then save, as
you probably know, if two clients read the field and then both save it,
only the addition done by one of them will be actually saved.

Not sure about that. Could be that two versions shortly after each
other are saved which means that practically nobody sees the first
version of the two. See the link above.

So I wonder what is the best practice for handling this kind of
situations with Couch Potato or with CouchDB at all? I really don’t want
to have to use SQL just because of that…

Generally this is best discussed in a CouchDB forum, I’d say. In
the worst case you need to manually lock. I doubt though that CouchDB
is intended for such a use case. It may be the wrong tool for the job
if you need to ensure exclusive access to a resource such a shared
counter.

Cheers

robert

orenshani · May 20, 2011, 3:11pm

Robert K. wrote in post #999847:

On Fri, May 20, 2011 at 12:50 PM, Oren S. [email protected]
wrote:

Apache boast that CouchDB does not lock the DB documents as SQL locks
tables, and that users always get their requests granted via a queue.

First of all, SQL does not lock tables. SQL is a declarative language
which describes relations. There are various RDBMS around which use
variants of SQL. SQL itself has no knowledge of locking. Locking is
something which happens in a RDBMS to ensure certain transactional
properties. This also depends on the concurrency level chosen at a
time.

For example, in the default concurrency level PostgreSQL and Oracle do
not block readers if an update is under way (called MVCC). Readers
simply see an older version of the record - which seems to be similar
to what CouchDB does:
Overview — Apache CouchDB® 3.3 Documentation

I know that… well… you understand what I meant

But what about the classic concurrency conflicts that SQL uses locks for
in the first place? for example, suppose I have a field in some document
that each client should read, increase by some value and then save, as
you probably know, if two clients read the field and then both save it,
only the addition done by one of them will be actually saved.

Not sure about that. Could be that two versions shortly after each
other are saved which means that practically nobody sees the first
version of the two. See the link above.

So I wonder what is the best practice for handling this kind of
situations with Couch Potato or with CouchDB at all? I really don’t want
to have to use SQL just because of that…

Generally this is best discussed in a CouchDB forum, I’d say. In
the worst case you need to manually lock. I doubt though that CouchDB
is intended for such a use case. It may be the wrong tool for the job
if you need to ensure exclusive access to a resource such a shared
counter.

Cheers

robert

Apache have a CouchDB mailing list, not a forum. I hate mailing lists
but I will
post a question there too (one must be willing to suffer for a noble
cause). Still I thought maybe someone here could give me a more
Ruby-oriented advice

Thanks

Oren

orenshani · May 20, 2011, 3:56pm

On Fri, May 20, 2011 at 3:11 PM, Oren S. [email protected]
wrote:

properties. This also depends on the concurrency level chosen at a
time.

For example, in the default concurrency level PostgreSQL and Oracle do
not block readers if an update is under way (called MVCC). Readers
simply see an older version of the record - which seems to be similar
to what CouchDB does:
Overview — Apache CouchDB® 3.3 Documentation

I know that… well… you understand what I meant

Yes, but others might not. I thought the clarification would help.

So I wonder what is the best practice for handling this kind of
situations with Couch Potato or with CouchDB at all? I really don’t want
to have to use SQL just because of that…

Generally this is best discussed in a CouchDB forum, I’d say. In
the worst case you need to manually lock. I doubt though that CouchDB
is intended for such a use case. It may be the wrong tool for the job
if you need to ensure exclusive access to a resource such a shared
counter.

Apache have a CouchDB mailing list, not a forum. I hate mailing lists
but I will
post a question there too (one must be willing to suffer for a noble
cause).

Still I thought maybe someone here could give me a more
Ruby-oriented advice

Well, if all your clients are Ruby programs you could of course
implement the locking in Ruby e.g. by having a lock server process
which is talked to via DRb. All the tools you need for that are
there: DRb for communication, Mutex and Monitor for locking. Whether
that’s a good solution is a different cup of tea to answer. It’s
always a bit difficult to talk about these things on such a general
level…

Kind regards

robert

orenshani · June 4, 2011, 8:30am

Oren S. wrote in post #999838:

Hi All,

So now (with some starter help I got here), I played around with Couch
Potato enough to see that I really like it. I have one big problem
though and this is really critical for what I am developing.

Apache boast that CouchDB does not lock the DB documents as SQL locks
tables, and that users always get their requests granted via a queue.
But what about the classic concurrency conflicts that SQL uses locks for
in the first place? for example, suppose a document
that each client reads, increases a field in the document by some value
and then saves the document, as you probably know, if two clients read
the document and then both save it, only the increment done by one of
them will be actually take effect instead of the sum of both.

So I wonder what is the best practice for handling this kind of
situations with Couch Potato or with CouchDB at all? I really don’t want
to have to use SQL just because of that…

Many thanks,

Oren

Okay so here is how it is done, but before I tell you, I have to
apologise to Apache for using the word “boast” above. They do not just
boast. This is working and is very cool.

So this is what happens with CouchDB when there is a conflict ( I am
using CouchPotato but it works the same with the bare-bone CouchDB Ruby
API

Client 1 reads revision 1 , changes the doc content and saves -
revision 2 is created
Client 2 reads revision 1 in parallel to client 2, changes the
content and tries to save - Couch Potato raises a conflict exception

So now all client 2 has to do is to read revision 2, resolve the
conflict between what client 1 wrote (in revision 2) to what it was
attempting to write and save ( revision 3 will be created).

So this is how the ruby code looks like:

 def save_and_if_conflict
      while (true) do
            begin
                self.plain_save
                break
            rescue
                 myobj = @kobj
                 self.read
                 @kobj = yield(myobj,@kobj)
            end
      end
  end

For the example I gave in my question, this will be used like this:

orig = kapa.kobj.counter
kapa.store(orig+myinc)

kapa.save_and_if_conflict { |my,his|
his.counter = his.counter+my.counter-orig
his
}

orenshani · May 20, 2011, 4:23pm

Robert K. wrote in post #999871:

Well, if all your clients are Ruby programs you could of course
implement the locking in Ruby e.g. by having a lock server process
which is talked to via DRb. All the tools you need for that are
there: DRb for communication, Mutex and Monitor for locking. Whether
that’s a good solution is a different cup of tea to answer.

Hmmm… well I thought about that, and kind of hoped not to have to use
a lock server, but if I will have no other choice then using DRb is a
good idea, thanks!

It’s always a bit difficult to talk about these things on such a general
level…

This is true. Hopefully, I will soon be able to share with the forum
what I am doing and then I will be able to be more specific

Kind regards

robert

orenshani · June 5, 2011, 12:31pm

Oren S. wrote in post #1003069:

For the example I gave in my question, this will be used like this:

orig = kapa.kobj.counter
kapa.store(orig+myinc)

kapa.save_and_if_conflict { |my,his|
his.counter = his.counter+my.counter-orig
his
}

Given that the second update may also cause a conflict, and this could
happen many times, maybe your API should make the “original” update and
the “when conflicting” update be the same block to avoid duplication.
Then you could call it something like this:

kapa.update_and_save do |doc|
doc.counter += myinc
end

orenshani · June 5, 2011, 11:22am

Oren S. wrote in post #999838:

But what about the classic concurrency conflicts that SQL uses locks for
in the first place? for example, suppose a document
that each client reads, increases a field in the document by some value
and then saves the document, as you probably know, if two clients read
the document and then both save it, only the increment done by one of
them will be actually take effect instead of the sum of both.

So I wonder what is the best practice for handling this kind of
situations with Couch Potato or with CouchDB at all? I really don’t want
to have to use SQL just because of that…

I don’t use Couch Potato (I wrote my own Ruby API called couchtiny), but
I can give you some generic CouchDB advice.

There are basically two distinct cases in CouchDB to handle, and this
IMO is probably CouchDB’s main design flaw.

Firstly: if the two updates go to the same server, then you will get a
409 HTTP error on the second update, i.e. the second update will be
rejected. It is therefore up to the application to re-fetch the
(fresher) document, re-update it, and re-PUT it. In CouchPotato I expect
you would have to rescue this 409 as some sort of exception.

If the two updates go to different CouchDB server instances
(presumably these are periodically replicating to each other), then both
updates will be accepted silently. When the next replication takes
place, then both servers will have the document in a “conflicting”
state, meaning that both databases hold both versions of the document.

By default, CouchDB hides this conflicting state. That is, when you do a
GET on the document, you will see just one of the versions. (The same
choice will be made on both servers). If there’s a possibility that the
document could be conflicting, you have to ask explicitly for the
conflicting state and rev ids, and then you can fetch the individual rev
ids.

You can resolve the conflict by fetching those versions, performing some
application-specific logic to make a new merged version of the document,
then writing this back and deleting the conflicting versions.
Unfortunately, just having the two versions available is often not
enough information to resolve the conflict. You could merge more easily
if you had the parent version of the document available; often it is,
but CouchDB doesn’t guarantee that it will be (in particular, if the
user has run a ‘compact’ operation on the database, all previous
non-conflicting revisions will be lost). This means that you may need to
add extra data fields to your document to support merging, e.g. some
sort of history member.

Now: having to have two code paths, one for 409 rejected updates and one
for conflicting replicated updates, is an utter pain. One way to solve
this is to use conflicting updates in all situations. You can force
CouchDB insert a conflicting update, rather than giving a 409 response,
by using the POST bulk-update mechanism (instead of PUT) and specifying
“all_or_nothing”:true. But this still means that whenever you fetch a
document you need to fetch all the currently live revisions, and
CouchDB’s GET API is not convenient to achieve that (e.g. there’s no
operation which says “fetch me all current versions of this document”,
or there wasn’t when I last looked at this)

I wrote up some more detail on this whole area here:
http://wiki.apache.org/couchdb/Replication_and_conflicts

HTH,

Brian.

orenshani · June 5, 2011, 12:43pm

Brian C. wrote in post #1003232:

Given that the second update may also cause a conflict, and this could
happen many times, maybe your API should make the “original” update and
the “when conflicting” update be the same block to avoid duplication.
Then you could call it something like this:

kapa.update_and_save do |doc|
doc.counter += myinc
end

Brian, this is true. In fact, after I sent my “solution” above I
realized that to resolve the conflict, the client should simply repeat
exactly what it was doing to update the document in the first place. I
will post some code based on this observation later.

As to what you said about multiple (replicated) databases, well this is
indeed a problem since in this case replication is done independently of
read/writes by the clients. I did not consider this possibility yet but
it seems to me that if I read all replications, I should be able to
merge them by some repetition of the update operation by the client.

Finally I must say that this brings to mind some interesting thoughts
about conflicts resolution in the broader sense…

Oren

orenshani · June 8, 2011, 7:28am

Oren S. wrote in post #1003233:

Brian, this is true. In fact, after I sent my “solution” above I
realized that to resolve the conflict, the client should simply repeat
exactly what it was doing to update the document in the first place. I
will post some code based on this observation later.

The above is implemented with the following method:

  def update_and_save
     while (true) do
            begin
                @kobj = yield(@kobj)
                self.plain_save
                break
            rescue RestClient::Conflict
                 self.read
            end
      end
  end

So the increment example can look like this:

kapa.update_and_save { |obj| obj.counter += myinc ; obj }

Sometime tough the conflict resolution actions may be different than the
normal update actions and then you should still use save_and_if_conflict

orenshani · June 8, 2011, 11:07am

BTW there is another CouchDB feature you could look at: the _update
handler.

And note, the update handler only gets one (arbitrary) version of the
document, rather than all of them. It is therefore unable to perform any
conflict resolution, even if it wanted to.

orenshani · June 8, 2011, 11:02am

Oren S. wrote in post #1003858:

Sometime tough the conflict resolution actions may be different than the
normal update actions and then you should still use save_and_if_conflict

Logically, I don’t see any good reason why the actions should be
different, at least if you are only updating one document.

Consider the two cases:

(A) No conflict

---->time
Client 1 *read update *write
Client 2 *read update *write

(B) Conflict

---->time
Client 1 *read update *write
Client 2 *read update *write

The only difference is that in B, client 2 is reading slightly earlier
(just before client 1 had written). If it happened to read a little bit
later, there would have been no conflict, and it would have done the
update just fine.

Therefore, this is a simple race, and you should code so that you get
the same outcome regardless of the race winner. If you didn’t, then your
application would behave non-deterministically.

However, if you are updating multiple documents together, then it’s a
different case. CouchDB provides no concept of “transaction”; the best
it offers you is a POST to _bulk_docs with “all_or_nothing”:true, which
guarantees that either all of the updates will be written to the
database, or none of them. However this mode also doesn’t perform any
conflict detection, i.e. you will get conflicting updates just the same
as if they had replicated in.

I should say at this point that I think CouchDB is an excellent piece of
software, and its incremental map-reduce and incremental replication is
truly awesome. I just think they made a mistake in attempting to hide,
rather than embrace, conflicting updates. The model described by the
Amazon Dynamo paper takes the opposite view: “write always succeeds”.
Whenever you read a document, you see all the current conflicting
versions, which forces the reader to resolve conflicts; and whenever you
write a document, you list all the parent(s) you are superceding.

So I think CouchDB’s API should have embraced this too.

PUT should alway succeed (even if it generates a conflict)
GET a document should always return an array of all live versions
of the document, not a single arbitrary version. Ditto for bulk fetch
and views.
PUT a document should specify a list of the revs it is superceding,
not a single rev.

Emulating this behaviour is hard. Well, it’s not too tricky to get (1)
if you use POST to _bulk_docs with “all_or_nothing”:true, and you can
hide this in client-side API.

However (2) and (3) are a real pain and inefficient; you have to
explicitly issue multiple GETs to get the conflicting revs and the
conflicting versions (especially when processsing a view); and when you
resolve a conflict, you have to replace one rev then explicitly delete
the other revs.

The current HTTP API strongly encourages you to ignore conflicts and
cross your fingers, which of course means applications are built that
way, which means they won’t scale to multi-master environments or
replication with off-line updates. What users will see is that changes
made on one side or the other are simply “lost” - they actually exist in
the database, but the frontend doesn’t let them see them.

BTW there is another CouchDB feature you could look at: the _update
handler. Here you can write some Javascript to perform an update of an
existing document, rather than the client having to POST the complete
object back again. This gives you something of a “model” layer you can
use, although it’s limited to updating a single document at a time.
http://wiki.apache.org/couchdb/Document_Update_Handlers

Regards,

Brian.

orenshani · June 11, 2011, 11:37am

Brian C. wrote in post #1003887:

Therefore, this is a simple race, and you should code so that you get
the same outcome regardless of the race winner. If you didn’t, then your
application would behave non-deterministically.

You are right, in theory, but for pargmatic reasons you may want to do
something a little different. Take the Wikipedia for example. If there
is an editing conflict, then in theory, the editor that makes the later
save, should just be notified that there is a newer revision than the
one he or she were editing and that the new revision should be used as
the basis. Most likely that this editor will repeat the same
“Wikipedian” action and read the new revision, decide what (if at all)
still need to be changed and edit and save. Still to achieve all this in
practice, the Mediawiki software displays a special editing conflict
page - so in practice it does something different than in a normal
update

I should say at this point that I think CouchDB is an excellent piece of
software, and its incremental map-reduce and incremental replication is
truly awesome. I just think they made a mistake in attempting to hide,
rather than embrace, conflicting updates.

I have to disagree, what they are doing is good enough to enable you to
resort to some well known means to ensure data integrity over replicated
DB’s, because all you have to do is to make sure that conflicts are
always reported back to, and managed by, the client software. You may
for example direct all updates to a single server while the other
replicas are standby backups. If the current active server fails, you
direct users to one of the backup servers. In this way, even if the new
sever has an outed revision, the clients conflict resolution mechanism
can still be used to do the damage control.