Request for feedback on my new data storage library

Detlef_R · May 11, 2015, 11:17pm

I’ve been working on a data storage library for Ruby on and off for a
while now. The idea was to create a new interface for storing data on
persistent devices that met the following requirements:

It must be transparent. In the majority of use cases, classes that
implement the interface should behave as any other Ruby object.
It must be easy to use and implement. It should be as simple as
possible to just make something that works, but powerful enough to do
more advanced things (such as implementing database like functionality).
It must be efficient. While some trade-offs have to be made for the
sake of keeping things simple, it needs to be fast enough to actually be
useful.

A key data store was the obvious starting point, because there are
plenty of key data stores available for Ruby that satisfy the second and
third requirement. However key data stores are generally very simple,
and leave you on your own to implement more advanced functionality. So
what I felt was needed was something to bridge the gap between a key
data store and the more advanced capability needed for a program with
complex data storage requirements.

The interface I was designing evolved over time, and eventually settled
into a driver based structure (similar to Moneta). The driver structure
is a common interface for key data stores that allows them to be used
interchangeably as storage devices. But as I continued to work on
things, it quickly became apparent that most key data storage engines
for Ruby had shortcomings that would limit their usefulness. Eventually
the need for a few features came up, which the majority of key data
stores did not have. They were, built in transactions, and the ability
to iterate through keys from arbitrary locations (both forward and
reverse).

Transactions would not have been impossible to implement, but doing so
in a way that was efficient in pure Ruby was problematic. Even adding a
small amount of code to read/write calls to the drivers in Ruby could
have a severe impact on the performance of the engine. Implementing a
way to iterate through keys was even more problematic. The only way to
do it efficiently, was to create a separate index of keys from the main
key data store, which would have created a huge overhead cost in terms
of data storage. Eventually I settled on three engines that could
satisfy the requirements: KyotoCabinet, LevelDB, and LMDB.

The first one I tried creating a driver for was KyotoCabinet, because
it’s design seemed to be the most closely aligned with what I was trying
to achieve. But I ended up abandoning work on that driver because bugs
in the currently unmaintained library were preventing me from making
progress. Next I tried LevelDB, but it’s limited transaction system
proved too difficult to work around in an efficient way, and I
eventually abandoned that driver as well. Finally, I tried writing a
driver for LMDB, and while it wasn’t perfect, it did satisfy all the
requirements in a reasonably efficient way. In retrospect, the driver
structure may have been a mistake, since only one engine that currently
exists ended up meeting the needs of the project, and even then there
was more overhead than there really needed to be. If this project were
to be deemed useful by the community, I think a custom storage engine
would be the next logical step.

In addition to work on the storage drivers, I spent a lot of time
working out a structure for the interface that would be both easy to
implement, and flexible enough to be able to work for a variety of
different purposes. Here is an overview of the classes I created:

Storage::Node – The basic storage class that all other storage classes
should inherit from. It provides an interface for storing a variety of
different data types, as well as an interface for storing other storage
nodes. It’s similar to the file/folder structure most operating systems
use, except that it is unified, allowing both data and child nodes to be
stored in the same object.

Storage::Device – The basic driver class that all storage drivers should
inherit from. It provides a common interface for various data storage
engines to be used interchangeably as storage devices, similar to the
way a disk partition would be used on an operating system.

Storage::Transaction – A class that provides a common interface for
executing transactions on a storage node, or a group of storage nodes.

Storage::Index – A class that defines an interface for storing storage
nodes of a specific type or types in a way that allows them to be easily
created an indexed. This class is a little sparse, and I not really sure
how useful it is in practice, because the other index classes don’t even
inherit from it.

Storage::Data – A class used to store data directly by key. It can be
used the same way as a regular key data store would be. It cannot
contain any child nodes, but it may be the child of another node.

Storage::ArrayIndex – A class used to store storage nodes indexed by
integer, and containing array like functionality.

Storage::HashIndex – A class used to store storage nodes indexed by the
values of the data they contain. It can be used in a way similar to a
database, and supports simple queries.

Here is an example that demonstrates how to create a device, and use it
to store data:

require “Node/Storage”
require “Node/Storage/Transaction”
require “Node/Storage/Device/LMDB”

#create the root device. the root device is assigned by sending it to
Device.root, or by assigning it to Device[:root]. all nodes are created
on the root device by default. if no root device is assigned, the memory
device (Device[:memory]) is used instead.
device = Storage::Device.root = Storage::Device::LMDB::Node.new(“data”,
{:mapsize => 33554432, :nosubdir => true})

#get the root storage node of the device. if no device is given, the
root node of the root device will be returned. all devices have a root
node, which is of the class Storage::Node. it should be thought of like
the root folder in a traditional file system.
root = Storage.root(device)

#start a transaction on the root node. while it is possible to write
data to a storage node outside of a transaction, it is usually not a
good idea. it is almost always slower (depending on the device
implementation), and it can potentially leave a storage node in a
corrupted state. note that some details of how transactions are handled
are device specific. for example, while it is only guaranteed that nodes
passed as arguments to Transaction.new will be included in the
transaction, in the case of the LMDB driver, all transactions are global
for the device, meaning that all access to the device is protected while
a transaction is occurring.
Storage::Transaction.new(Storage.root) do

#after opening a device, it is usually a good idea to clean it, which
deletes dead nodes that might not have been removed from the device
before it was closed. this should probably become automatic eventually,
but the current implementation is very slow on devices containing a
large number of nodes because it requires the entire device to be
scanned, so for now it is left as optional
device.clean

#create a basic storage node. although usually you would want to
implement you own nodes by inheriting from the Storage::Node class, a
basic storage node can still be used if you desire to do so. note that
while the device parameter is being used in this example for the sake of
demonstration, it is actually optional, because we already set it as the
root device earlier
node = Storage::Node.new(device)

#make the new node the child of the root node. all nodes must exist in
a hierarchy that starts with the root node in order to be stored
persistently. while you can create a node and assign data to it without
making it a child of another node, unless it exists in the hierarchy
that starts with the root node, is will be deleted when the node object
is garbage collected
root[“test”] = node

#write data to the node. normally you would build an interface that
would allow you to do this with accessor methods, but it can also be
done directly, as long as the data can be represented as a string
node.data[“foo”] = “bar”

#changes are committed when the transaction block is closed, unless an
exception is raised from within the block, in which case all changes are
reverted.
end

For a long time I had debated publishing this project, because I wasn’t
sure how useful other would find it, and it’s development has been very
tumultuous (it’s been rewritten from scratch at least three times now).
But I do feel like it has gotten to the point where it can be considered
usable, although the current code base is in severe need of cleaning up
(there are several files of unfinished and broken code not related to
the storage engine in the repository, and two of the three included
storage drivers are broken as well). I did however successfully use the
code to write a simple web application for the game Cordial Minuet that
tracks profit data, and it has so far performed very well for that task.
In general, I think it would be most useful for applications that have
more complex data storage needs than a simple key/data store could
provide, but where using a database would add unnecessary overhead and
complexity. With additional development, I even think the library could
reach the point where it was preferable to a database in most use cases.
But I’m only one person, and I really have no idea if the need/demand
for something like this is there. So I decided to post the project, in
it’s current somewhat messy and unfinished state, and see what people
think of the it and whether or not they think it is useful.

Here is a link to the main project git: https://github.com/An0Hit0/Node

And here is a link to the source code for the minuet web app (check out
the file lib/minuet/profit.rb for an example of using the HashIndex
class): https://github.com/An0Hit0/minuet

Any and all feedback is welcome.