Pre-allocate large amount of memory?

tuktuk · September 14, 2009, 11:15am

I’ve created a small daemon, that serves certain data very fast to our
remaining web-application. It does so by pre-loading a lot of data from
our database into a special structure in memory. Boiled down, the
pre-loading procedure looks like this:

@data_hash = {}
DBI.connect(dns, user, pass) do |dbh|
sth = dbh.execute(“some-sql-returning >300.000 rows”)
while row = sth.fetch_hash
hash_key =

@owner_relations[hash_key] ||= []
@owner_relations[hash_key] << row

end
sth.finish
end

It gives a @data_hash with a structure like this:

@data_hash = {
‘key1’ => [row1, row2, row3, …]
‘key2’ => [row4, row5, …]
…
}

I have a problem with the speed of the pre-loading though. I am loading
~360.000 rows into memory (about 500MB). The first 50% of the rows are
read and placed in the hash pretty quick. But after that it slows down.

I suspect that Ruby’s dynamic memory allocation for the Hash is to
blaim. Can I somehow pre-allocate 500MB memory for the hash? Or tweak
the way Ruby allocates memory?

Thanks in advance!

Carsten

tuktuk · September 14, 2009, 2:18pm

You could probably just use memcached. It’s optimized for efficient
memory allocation of hash based data stores.

tuktuk · September 14, 2009, 3:32pm

Jason S. wrote:

You could probably just use memcached. It’s optimized for efficient
memory allocation of hash based data stores.

I thought of that too. But doesn’t memcached expire its data after a
while? Or is it possible to disable that?

Carsten

tuktuk · September 14, 2009, 5:24pm

On Sep 14, 2009, at 8:32 AM, Carsten G. wrote:

Jason S. wrote:

You could probably just use memcached. It’s optimized for efficient
memory allocation of hash based data stores.

I thought of that too. But doesn’t memcached expire its data after a
while? Or is it possible to disable that?

You might consider using Redis:

Google Code Archive - Long-term storage for Google Code Project Hosting.

It definitely doesn’t have to expire keys. That’s an optional feature.

I’ll be posting a series of articles to my blog this week about using
it from Ruby, one each morning:

http://blog.grayproductions.net/articles/using_keyvalue_stores_from_ruby

James Edward G. II

tuktuk · September 14, 2009, 6:07pm

2009/9/14 Carsten G. [email protected]:

‘key2’ => [row4, row5, …]
…
}

I have a problem with the speed of the pre-loading though. I am loading
~360.000 rows into memory (about 500MB). The first 50% of the rows are
read and placed in the hash pretty quick. But after that it slows down.

I suspect that Ruby’s dynamic memory allocation for the Hash is to
blaim. Can I somehow pre-allocate 500MB memory for the hash? Or tweak
the way Ruby allocates memory?

Maybe it’s not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS’s
shine. What about querying individual records and storing them in a
LRU storage. You’ll find such a beast here which can be used like a
Hash:

Then you don’t burn large amounts of memory but retain the caching
effect.

Cheers

robert

tuktuk · September 14, 2009, 6:39pm

On Mon, Sep 14, 2009 at 10:05 AM, Robert K.
[email protected] wrote:

Maybe it’s not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS’s
shine. What about querying individual records and storing them in a
LRU storage. You’ll find such a beast here which can be used like a
Hash:

muppet-laboratories/lib at 9eb71ab7bf7d5db7396905b500bd3ac7cc153e0a · rklemme/muppet-laboratories · GitHub

Then you don’t burn large amounts of memory but retain the caching effect.

The super easy way to deal with this is to do what James suggested –
use redis.

It’s absurdly easy to use, and quite fast. You ARE limited by the
amount of data you can put into RAM, but you are doing that anyway, so
I assume that’s a reasonable limitation.

Using some sort of an LRU cache may work, too. It depends on whether
the query pattern is random access, or whether it it is clustered
around certain sets of data out of the whole, though.

Kirk H.

tuktuk · September 14, 2009, 7:37pm

Robert K. wrote:

Maybe it’s not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS’s
shine.

My reason for doing this is simply to avoid multiple sql-queries in a
recursive self-referencing datastructure. Usually this can be avoided by
querying the entire subset of data needed, and manipulate it in memory.
Unfortunately the data in question does not enable me to just query the
subset of data.

I will definetely have a look at Redis. Seems promising.

Thanks.

Carsten

tuktuk · September 14, 2009, 11:12pm

On 14.09.2009 19:37, Carsten G. wrote:

Robert K. wrote:

Maybe it’s not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS’s
shine.

My reason for doing this is simply to avoid multiple sql-queries in a
recursive self-referencing datastructure. Usually this can be avoided by
querying the entire subset of data needed, and manipulate it in memory.
Unfortunately the data in question does not enable me to just query the
subset of data.

Is your structure strictly hierarchical (i.e. a tree) or do you need to
query part of a graph with cycles? If it is strictly hierarchical there
is a solution that works for all RDBMS and in the latter case there are
solutions for some RDBMS.

Kind regards

robert

tuktuk · September 14, 2009, 8:01pm

On Mon, Sep 14, 2009 at 12:37 PM, Carsten G. [email protected]
wrote:

subset of data.

I will definetely have a look at Redis. Seems promising.

Thanks.

Carsten
–
Posted via http://www.ruby-forum.com/.

I had a situation like that, where I ended up using one of the nested
set
implementations (I think it was Awesomer Nested Set) instead of a tree
hierarchy. I ended up having to write my own (very ugly) iterating
interpreter which analyzed each node in to figure out it’s recursive
context. It took something like 3 or 4 blocks for the different places
you
could inject code based on a recursive interpretation, but it did get me
down to a single database call, and it did only require a single pass
through the data. At one point (debugging, ahem) I was very strongly
considering going with the approach you are taking. Good luck with it,
and
let me know how it turns out, I might opt for that approach instead,
next
time.

tuktuk · September 15, 2009, 6:29am

Robert K. wrote:

Is your structure strictly hierarchical (i.e. a tree) or do you need to
query part of a graph with cycles? If it is strictly hierarchical there
is a solution that works for all RDBMS and in the latter case there are
solutions for some RDBMS.

Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

My job is to - given a certain company X - to extract all its owners
and, recursively, their owners, etc. until I reach the “top”. Likewise
the other way to extract all companies owned by company X and,
recursively, all companies owned by them, etc.

Conceptually, the table (actually a view) I am querying holds the data:

CompanyA, Direction, CompanyB, Share
“FooCorp”, “owns”, “BarCorp”, “10%”
“BarCorp”, “ownedby”, “FooCorp”, “10%”
“BarCorp”, “owns”, “BazCorp”, “45%”
“BasCorp”, “ownedby”, “BarCorp”, “45%”
“QweCorp”, “owns”, “RteCorp”, “20%”
“RteCorp”, “ownedby”, “QweCorp”, “20%”
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

In the example above: If given “FooCorp”, all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

I haven’t found a solution yet. This is why I’ve gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

BTW: Thanks for all your great suggestions so far.

Carsten

tuktuk · September 15, 2009, 8:11am

On Mon, Sep 14, 2009 at 11:29 PM, Carsten G. [email protected]
wrote:

“BarCorp”, “owns”, “BazCorp”, “45%”
should be extracted and used in the result. How would you go about doing
Posted via http://www.ruby-forum.com/.

As far as retrieving them all in a single query, nested sets would do
this.
There are other hits that you take from using them, though. Changing the
hierarchy, for example, is very expensive (ie company a sells company b
or
purchases company c), and treating it as a recursive structure after you
have it retrieved is quite painful. If you do opt for this solution, I
can
send you my code to give recursive functionality through iteration.
Though I
did change some lines in the plugin itself, to get more flexible use,
but I
think I marked them all.

tuktuk · September 15, 2009, 9:16am

2009/9/15 Carsten G. [email protected]:

Robert K. wrote:

Is your structure strictly hierarchical (i.e. a tree) or do you need to
query part of a graph with cycles? If it is strictly hierarchical there
is a solution that works for all RDBMS and in the latter case there are
solutions for some RDBMS.

Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

“BarCorp”, “owns”, “BazCorp”, “45%”
“BasCorp”, “ownedby”, “BarCorp”, “45%”
“QweCorp”, “owns”, “RteCorp”, “20%”
“RteCorp”, “ownedby”, “QweCorp”, “20%”
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

If I understand that table design properly it is awful because
semantics of columns one, three and four change based on content of
column two. The usual way would be to model this with fixed
semantics, i.e. only have one direction of ownership relation in the
table. In your case you will probably have to do a normalization step
by defining a view on this table with a UNION ALL or use a WITH clause
in the query to ensure the query can be built in a reasonable way.

In the example above: If given “FooCorp”, all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

There are features in modern RDBMS which allow for recursive querying.
In PostgreSQL and Microsoft SQL Server you can use WITH expression:

In Oracle there is CONNECT BY
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_10002.htm#i2066102

The downside is that recursive queries tend to have a performance hit
as the DB engine cannot fetch everything via a single index and needs
to look at results so far to know what other records it has to
retrieve.

The nested set model (Josh mentioned it as well) might help although I
haven’t thought through all implications in your case:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
http://www.codeproject.com/KB/database/nestedsets.aspx

I haven’t found a solution yet. This is why I’ve gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

You still have the issue that you maintain redundant data and must
find a way to invalidate your cache when the base data changes. Since
a complete reload of the cache is expensive (as you have experienced)
this will get complicated soon because in order to find the data that
must be refreshed you need similar queries like those to find answers
to the questions you placed above.

BTW: Thanks for all your great suggestions so far.

You’re welcome!

Kind regards

robert

tuktuk · September 16, 2009, 6:14am

My 2 cents

I have used this kind of caching method in past and works really well
for
me. I was able to speed up a database-with-very-heavy-nested-data
to csv script by more than 30x and this despite the fact that I had
analyzed
all queries and had all necessary indexes in place.
So I guess atleast it works if the amount of data is not very big. I
guess(and I can be severely wrong) that can also be overcome by using
something like localmemcache gem which can be used to dump this data on
an
arbitrary memory location which is not handled by ruby process.

Piyush

On Tue, Sep 15, 2009 at 6:14 PM, Robert K.

tuktuk · September 15, 2009, 11:59am

Robert K. wrote:

So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

Yes that is allowed (we define some rules in the extraction code to
avoid infinite recursion).

If I understand that table design properly it is awful because
semantics of columns one, three and four change based on content of
column two. The usual way would be to model this with fixed
semantics, i.e. only have one direction of ownership relation in the
table. In your case you will probably have to do a normalization step
by defining a view on this table with a UNION ALL or use a WITH clause
in the query to ensure the query can be built in a reasonable way.

Actually the above structure is a view, which adds this direction. The
real data-table only holds one record each relation, that is:

“BasCorp”, “ownedby”, “BarCorp”, “45%”
“RteCorp”, “ownedby”, “QweCorp”, “20%”

There are features in modern RDBMS which allow for recursive querying.
In PostgreSQL and Microsoft SQL Server you can use WITH expression:
PostgreSQL: Documentation: 8.4: WITH Queries (Common Table Expressions)
WITH common_table_expression (Transact-SQL) | Microsoft Learn

I am using Microsoft SQL Server, but I didn’t know about recursive
querying.

The downside is that recursive queries tend to have a performance hit

Perhaps the best way will be going with my current setup (i.e. loading
the entire data). But use recursive querying to reload part of the data
when relations are changed.

The nested set model (Josh mentioned it as well) might help although I
haven’t thought through all implications in your case:

I would rather not begin to alter my data structure.

You still have the issue that you maintain redundant data and must
find a way to invalidate your cache when the base data changes.

I think that I may have found a way to calculate, exactly how many
records I need to reload, if the relation between two companies are
added/changed/deleted. And SQL Sever’s WITH option might help me there.

I will have a look at that now. Thanks a bunch for your input - all of
you.

I will post my results, when I get there.

Carsten

tuktuk · September 15, 2009, 2:45pm

2009/9/15 Carsten G. [email protected]:

Robert K. wrote:

So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

Yes that is allowed (we define some rules in the extraction code to
avoid infinite recursion).

Good.

“BasCorp”, “ownedby”, “BarCorp”, “45%”
“RteCorp”, “ownedby”, “QweCorp”, “20%”

Oh.

Perhaps the best way will be going with my current setup (i.e. loading
the entire data). But use recursive querying to reload part of the data
when relations are changed.

Before you do that: I would first try out the query and see how well
it performs with your data. If not, then maybe additional indexing
may help. Only if that also fails then I would switch to a caching
approach.

The reason: I would try to keep architecture simple and avoid
redundant data storage as this tends to cause issues. If you can get
the data you need just in time via a reasonable efficient query then
this is much more preferable to your big caching process approach.

The nested set model (Josh mentioned it as well) might help although I
haven’t thought through all implications in your case:

I would rather not begin to alter my data structure.

Why?

I will post my results, when I get there.
We’ll be waiting eagerly.

Cheers

robert