I am wondering about what DB to choose for this goal. I have a working filtering solution that uses Tokyo-Cabinet for domains and urls with REDIS which is really good in a sense of speed and space. I want to extend the filtering DB and use some new features which will require me to use a dedicated DB for the black\white lists. I Have tried mysql before but it takes so much space compared to Tokyo-Cabinet and it's an issue in my case. The difference is about 100%(50MB vs 100MB). So I am wondering about other DB solutions that will not use too much storage space will be fast and can scale if needed. I have heard the names Cassandra mongo and some others but I never tried them but used Oracel, mySQL, BDB, Postgresql. The DB search for domain is for either a prefix or a full match. If you do have experience with any DB the uses key-value pair I will be happy to hear about it. Thanks, Eliezer
on 2013-01-10 21:37
on 2013-01-11 14:11
On Thu, Jan 10, 2013 at 9:36 PM, Eliezer Croitoru <eliezer@ngtech.co.il> wrote: > So I am wondering about other DB solutions that will not use too much > storage space will be fast and can scale if needed. Scale in what direction? Database size, requests per second, multiple concurrent clients, multiple physical nodes the DB is stored on...? Kind regards robert
on 2013-01-14 12:34
On 1/11/2013 3:10 PM, Robert Klemme wrote: > robert > Thanks Robert, The scale is in couple directions: - multiple physical nodes the main DB stored on. - Database size - requests per second - master and secondary updates\replication For now one machine will host the DB while it gets updates from couple sources such as human and other auto-testing tools. This will be a dedicated DB machine while there are others servers which gets updates from the master DB when needed. The problem is that the updates are live and should be replicated with the smallest delay possible. Thanks, Eliezer
on 2013-01-14 23:43
On Mon, Jan 14, 2013 at 12:33 PM, Eliezer Croitoru <eliezer@ngtech.co.il> wrote: >> concurrent clients, multiple physical nodes the DB is stored on...? > The scale is in couple directions: > - multiple physical nodes the main DB stored on. > - Database size > - requests per second > - master and secondary updates\replication Well, less dimensions would be too easy, wouldn't it? :-) > For now one machine will host the DB while it gets updates from couple > sources such as human and other auto-testing tools. > This will be a dedicated DB machine while there are others servers which > gets updates from the master DB when needed. > The problem is that the updates are live and should be replicated with the > smallest delay possible. I don't have experience with CouchDB performance but given your list I would include it in your benchmarking activities. Kind regards robert
on 2013-01-15 14:24
On 1/15/2013 12:41 AM, Robert Klemme wrote: >>> Scale in what direction? Database size, requests per second, multiple >> For now one machine will host the DB while it gets updates from couple > > robert > > Thanks robert, I will look on CouchDB. It's basically http based which is a very nice simple and clean interface. I know I can start looking up using a search engine for results but since an experienced user\developer can say more many times I hope to get some directions about CouchDB. Thanks, Eliezer
on 2013-01-15 15:22
On 2013-01-14, at 6:33 AM, Eliezer Croitoru <eliezer@ngtech.co.il> wrote: >> >> robert >> > Thanks Robert, > Hi, Here are a bunch of questions to ask yourself that cover off things that I've found helpful to know. You might think about why you need to move from your combination of TokyoCabinet and Redis. In particular, why not Redis. It's a little more clear why you'd move from TokyoCabinet, I've used it happily for years myself, so I can imagine a bunch of reasons. > The scale is in couple directions: > - multiple physical nodes the main DB stored on. Is this for reliability or performance? This sounds like a solution not a requirement. > - Database size How big do you think it'll be? > - requests per second What kind of request rate are you thinking? Do you care about latency? (you should) Throughput and latency are pretty much independent variables when it comes to databases. > - master and secondary updates\replication Again, this sounds like a solution not a requirement. What's the issue that makes you say this? What is your read/write ratio? What is your write rate? Are you updating or writing new data, and what's the ratio of update to write? Do you need secondary indexes? How many, what kind? If you write to the master then replicate there'll be a time period where the various nodes will provide different results. Can your application tolerate this? or do you need some kind of stronger consistency constraint? Are your updates/writes exposing you to consistency issues? (i.e. do you need transactions?) If you update (or even write) multiple records, it's possible that the updates arrive in an essentially random order to the replicas, and possibly in a different order to the different replicas. > > For now one machine will host the DB while it gets updates from couple sources such as human and other auto-testing tools. > This will be a dedicated DB machine while there are others servers which gets updates from the master DB when needed. > The problem is that the updates are live and should be replicated with the smallest delay possible. What does "when needed" mean given that the updates should be "as soon as possible"? I'm thinking that this master/slave setup you're thinking of is lifted from how you'd do it with TokyoCabinet or Redis. Things like Cassandra or Riak or HBase don't do it that way. Are you ever going to have to scan your whole database? How often? Cheers, Bob
on 2013-01-15 16:14
Thanks Bob, To start from zero again. The DB is a url\domains filtering black\white lists. which means the DB is being scanned frequently mostly for a full match of a domain or a url\path. The DB now is pretty static and I am adding a feature which will make it more dynamic as it is now. Now I use a basic TC DB and redis for caching of results. (the reason i'm not using only redis is size) In the current state the DB is static for at least 12 hours which means I can trigger an update based on a simple "flag" for that. I have one storage which holds all the basic DB and then the nodes copy the complete DB from there as files. The DB will be separated into different parts: - queue DB based on url requests live from the proxy. - master DB which will get updates based on the queue DB. - nodes DB which will be static and will get updates from the master DB based on changes. The queue DB will be very heavy duty(more then 900 hits per sec and more). The master DB will be updated by couple sources in a rate of about 10k updates in rush hour. The nodes must be updated from the master with a maximum latency of 2 secs since filtering is a big issue.. The nodes DB is being requested in the rate of the incoming requests which is more then 1k requests per sec. The idea is to make updates possible in a more dynamic form then static files. On the master I must apply some consistency which will apply also to the nodes updates since Content filtering is a pretty important issue. Scan the whole DB? it's a search DB... you look for a match url or a match domain etc.. from MYSQL point of view the table should have index and two columns one for the url\domain and the second for the rating. just as a note: I have used mySQL for the static DB but it was slower(1000%+ slower) then TC and also the size of the mySQL DB was +100%. I have tried tweaking mySQL in couple aspects which wasn't enough. I may have did something wrong about it and I am open minded about any possible solution which can work nice in this setup. One of the problems is that the DB size will grow pretty fast when starting the system. I can start simulating to get a more specific numbers. Just imaging more then 1k requests per sec on a http proxy for a lot of users.. double that in about 10 and you will get a basic view of what to expect. Take in account that I will release the code of most of the systems I was working on such as ICAP server and some other related code. Eliezer
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.