What DB to use for lots of categorized urls and domains? noSQL? SQL?

elico · January 10, 2013, 9:37pm

I am wondering about what DB to choose for this goal.

I have a working filtering solution that uses Tokyo-Cabinet for domains
and urls with REDIS which is really good in a sense of speed and space.

I want to extend the filtering DB and use some new features which will
require me to use a dedicated DB for the black\white lists.

I Have tried mysql before but it takes so much space compared to
Tokyo-Cabinet and it’s an issue in my case.
The difference is about 100%(50MB vs 100MB).
So I am wondering about other DB solutions that will not use too much
storage space will be fast and can scale if needed.

I have heard the names Cassandra mongo and some others but I never tried
them but used Oracel, mySQL, BDB, Postgresql.

The DB search for domain is for either a prefix or a full match.

If you do have experience with any DB the uses key-value pair I will be
happy to hear about it.

Thanks,
Eliezer

elico · January 11, 2013, 2:11pm

On Thu, Jan 10, 2013 at 9:36 PM, Eliezer C. [email protected]
wrote:

So I am wondering about other DB solutions that will not use too much
storage space will be fast and can scale if needed.

Scale in what direction? Database size, requests per second, multiple
concurrent clients, multiple physical nodes the DB is stored on…?

Kind regards

robert

elico · January 14, 2013, 12:34pm

On 1/11/2013 3:10 PM, Robert K. wrote:

robert

Thanks Robert,

The scale is in couple directions:

multiple physical nodes the main DB stored on.
Database size
requests per second
master and secondary updates\replication

For now one machine will host the DB while it gets updates from couple
sources such as human and other auto-testing tools.
This will be a dedicated DB machine while there are others servers which
gets updates from the master DB when needed.
The problem is that the updates are live and should be replicated with
the smallest delay possible.

Thanks,
Eliezer

elico · January 14, 2013, 11:43pm

On Mon, Jan 14, 2013 at 12:33 PM, Eliezer C.
[email protected] wrote:

concurrent clients, multiple physical nodes the DB is stored on…?
The scale is in couple directions:

multiple physical nodes the main DB stored on.

Database size

requests per second

master and secondary updates\replication

Well, less dimensions would be too easy, wouldn’t it?

For now one machine will host the DB while it gets updates from couple
sources such as human and other auto-testing tools.
This will be a dedicated DB machine while there are others servers which
gets updates from the master DB when needed.
The problem is that the updates are live and should be replicated with the
smallest delay possible.

I don’t have experience with CouchDB performance but given your list I
would include it in your benchmarking activities.

Kind regards

robert

elico · January 15, 2013, 2:24pm

On 1/15/2013 12:41 AM, Robert K. wrote:

Scale in what direction? Database size, requests per second, multiple
For now one machine will host the DB while it gets updates from couple

robert

Thanks robert,

I will look on CouchDB.
It’s basically http based which is a very nice simple and clean
interface.

I know I can start looking up using a search engine for results but
since an experienced user\developer can say more many times I hope to
get some directions about CouchDB.

Thanks,
Eliezer

elico · January 15, 2013, 3:22pm

On 2013-01-14, at 6:33 AM, Eliezer C. [email protected]
wrote:

robert

Thanks Robert,

Hi,

Here are a bunch of questions to ask yourself that cover off things that
I’ve found helpful to know.

You might think about why you need to move from your combination of
TokyoCabinet and Redis. In particular, why not Redis. It’s a little more
clear why you’d move from TokyoCabinet, I’ve used it happily for years
myself, so I can imagine a bunch of reasons.

The scale is in couple directions:

multiple physical nodes the main DB stored on.

Is this for reliability or performance? This sounds like a solution not
a requirement.

Database size

How big do you think it’ll be?

requests per second

What kind of request rate are you thinking?

Do you care about latency? (you should) Throughput and latency are
pretty much independent variables when it comes to databases.

master and secondary updates\replication

Again, this sounds like a solution not a requirement. What’s the issue
that makes you say this?

What is your read/write ratio? What is your write rate? Are you updating
or writing new data, and what’s the ratio of update to write? Do you
need secondary indexes? How many, what kind?

If you write to the master then replicate there’ll be a time period
where the various nodes will provide different results. Can your
application tolerate this? or do you need some kind of stronger
consistency constraint?

Are your updates/writes exposing you to consistency issues? (i.e. do you
need transactions?) If you update (or even write) multiple records, it’s
possible that the updates arrive in an essentially random order to the
replicas, and possibly in a different order to the different replicas.

For now one machine will host the DB while it gets updates from couple sources
such as human and other auto-testing tools.
This will be a dedicated DB machine while there are others servers which gets
updates from the master DB when needed.
The problem is that the updates are live and should be replicated with the
smallest delay possible.

What does “when needed” mean given that the updates should be “as soon
as possible”? I’m thinking that this master/slave setup you’re thinking
of is lifted from how you’d do it with TokyoCabinet or Redis. Things
like Cassandra or Riak or HBase don’t do it that way.

Are you ever going to have to scan your whole database? How often?

Cheers,
Bob

elico · January 15, 2013, 4:14pm

Thanks Bob,

To start from zero again.
The DB is a url\domains filtering black\white lists.
which means the DB is being scanned frequently mostly for a full match
of a domain or a url\path.

The DB now is pretty static and I am adding a feature which will make it
more dynamic as it is now.
Now I use a basic TC DB and redis for caching of results.
(the reason i’m not using only redis is size)

In the current state the DB is static for at least 12 hours which means
I can trigger an update based on a simple “flag” for that.

I have one storage which holds all the basic DB and then the nodes copy
the complete DB from there as files.

The DB will be separated into different parts:

queue DB based on url requests live from the proxy.
master DB which will get updates based on the queue DB.
nodes DB which will be static and will get updates from the master DB
based on changes.

The queue DB will be very heavy duty(more then 900 hits per sec and
more).

The master DB will be updated by couple sources in a rate of about 10k
updates in rush hour.

The nodes must be updated from the master with a maximum latency of 2
secs since filtering is a big issue…

The nodes DB is being requested in the rate of the incoming requests
which is more then 1k requests per sec.

The idea is to make updates possible in a more dynamic form then static
files.

On the master I must apply some consistency which will apply also to the
nodes updates since Content filtering is a pretty important issue.

Scan the whole DB?
it’s a search DB… you look for a match url or a match domain etc…
from MYSQL point of view the table should have index and two columns one
for the url\domain and the second for the rating.

just as a note:
I have used mySQL for the static DB but it was slower(1000%+ slower)
then TC and also the size of the mySQL DB was +100%.
I have tried tweaking mySQL in couple aspects which wasn’t enough.

I may have did something wrong about it and I am open minded about any
possible solution which can work nice in this setup.

One of the problems is that the DB size will grow pretty fast when
starting the system.

I can start simulating to get a more specific numbers.
Just imaging more then 1k requests per sec on a http proxy for a lot of
users… double that in about 10 and you will get a basic view of what to
expect.

Take in account that I will release the code of most of the systems I
was working on such as ICAP server and some other related code.

Eliezer