HOWTO: "catching" a segfault in a ruby/dl C library

For my research, I’ve written bindings for the link-grammar[1] library
in ruby, and I am using them to parse all of the sentences in a corpus
of text and insert the parses into a database. We begin with code
which works roughly as follows:

$dbh=DBI.connect(...)

#for simplicity, it doesn't matter what the Something is.
sentences=Something.new($dbh)

d=Dictionary.new #Dictionary is an object wrapping the C library

def putindatabase link
   #for our purposes, it doesn't really matter what this does
   #except to say it uses $dbh which we opened before
end

def parse sentencetext
   sentence=d.parse(sentencetext) #sentence also wraps the C library
   sentence.linkage[0].links.each do |link|
      putindatabase link
   end
end

#sentences.each fetches every sentence from the database and yields 

each
#one to the block. it keeps an open connection between yields
sentences.each do |sentence|
parse sentence
end

Now, the link-grammar library that I’m using has a bug. It
segfaults[2] while freeing resources under some conditions that I
haven’t quite figured out enough to fix in the C library itself.
Now, arguably it would be nice to catch the segfault as though it’s an
exception, so that we could move on to the next sentence and get on
with our lives. But ruby/dl doesn’t let us do that, and even if
ruby/dl did let us do that, it could leave the link-grammar library in
an inconsistent state. So we’ll do the next best thing. We’ll fork a
subprocess to handle each sentence. This will solve a few problems:

  • If a sentence fails, we’ll be able to move on to the next one.
  • A sentence won’t fail before being put in the database, since the
    problem occurs when freeing the resources used to parse the sentence.
  • We probably won’t segfault at all because this only occurs under
    complicated circumstances which seem to involve the fact that you’ve
    parsed more than one sentence with the same dictionary.
  • We don’t have to clean up properly at all, since the termination of
    the
    child process after each sentence automatically takes care of that
    for us. (The link-grammar library doesn’t allocate any resources
    that the OS doesn’t know how to dispose of.) This may make
    subprocess termination faster.

So we would like to change our code to say:

sentences.each do |sentence|
   Process.waitpid fork {parse sentence}
end

(for that last bullet point, we’d also need to edit the link-grammar
bindings, but I won’t worry you with the details of that. I haven’t
actually implemented it yet.)

That’s all, right?
Oy, vey! Testing this, we quickly see that DBI can’t put anything in
the database, except for the first sentence. Why? Because when the
child process exits, it closes the database connection, which affects
the parent too. (i.e. DBI isn’t fork-safe)

But it turns out DRb is fork-safe, so I create another process of
“middleware” and have that be responsible for the database connection:

serverpid = fork do
   dbh=DBI.connect(...)
   Signal.trap("INT"){exit}
   acl=ACL.new(%w{deny all allow 127.0.0.1})
   DRb.install_acl(acl)
   DRb.start_service('druby://localhost:9001',dbh)
   DRb.thread.join
end
sleep 1 #wait for the server to be setup before continuing
DRb.start_service
$dbh=DRbObject.new(nil,'druby://localhost:9001')

## all of the previous code like before
## and then at the end, we kill the DRb server thread:

Process.kill("INT",serverpid)

Now, we have a painless way to make DBI fork-safe. (Note that DBI
still isn’t thread safe, and this only works because I’m keeping all
of the child processes serially ordered, but it’s not a bad
modification technique to handle this kind of error.)

Footnotes:
[1] Link Grammar
http://www.abisource.com/downloads/link-grammar
[2] Bug 10391 - make sentence_delete and linkage_delete not depend on each other