The best way to match these domains

thanks in advance i need a bit help to break the ice that my head is in.

i am working on ICAP server and i am building some acls for content
filtering.

i have a list of domains and i want to apply acl such as “.example.com”
will match all domains that starts with “example.com”.
but the problem is that domain start in a reverse order.
i was thinking on what is the best\better way to match a domain to
domain list ?
i have a sql(PGSQL\MYSQL) db that contains the list of domains.
so i was thinking to use “divide and Conquer way” to dissemble the
requested domain such as “subporndomain.example.com” to
[“com”,“example”,“subporndomain”] and then to fetch from the db all the
domains that starts with “example.com” and then check match each and one
of them(if any record exists) to the requested domain.

i was thinking of using the “string”.start_with?
method in reverse such as
“moc.elpmaxe.niamodnropbus”.start_with?([“moc.elpmaxe”, …])
with the list of domains as a prefix…
but there is one problem: i cant match all the domains as prefix because
some of them are not a prefix but an exact match.
so i will need to do some code like this:

dom = “subdomain.example.com
domacl = “.example.com”
domacl1 = “example.com

def match?(domacl,dom)
if domacl.start_with?(“.”)
return ( dom.reverse.start_with?(domacl.reverse.chop) )
else
return true if dom.eql?(domacl)
return false
end
end

match?(domacl,dom)
=> true
match?(domacl1,dom)
=> false

so it works…

my goal is maximum efficiency.

Thanks,
Elizer

On Thu, Jul 5, 2012 at 4:13 AM, Eliezer C. [email protected]
wrote:

thanks in advance i need a bit help to break the ice that my head is in.

i am working on ICAP server and i am building some acls for content
filtering.

Ah, interesting!

i have a list of domains and i want to apply acl such as “.example.com” will
match all domains that starts with “example.com”.
but the problem is that domain start in a reverse order.
i was thinking on what is the best\better way to match a domain to domain
list ?

If in memory (which does not seem to be the case here) I would create
a forest for matching with TLD’s as root level:

com
± example
± google

i have a sql(PGSQL\MYSQL) db that contains the list of domains.
so i was thinking to use “divide and Conquer way” to dissemble the requested
domain such as “subporndomain.example.com” to
[“com”,“example”,“subporndomain”] and then to fetch from the db all the
domains that starts with “example.com” and then check match each and one of
them(if any record exists) to the requested domain.

That won’t work since if you have a domain “foo.example.com” using
example.com” as prefix won’t retrieve it.

Since you have a relational database, I would store domains in reverse
order (“example.com” → “com.example”), create an index on that column
and use LIKE operator, for example:

– just a matching test
SELECT MIN(rev_domain)
WHERE rev_domain = ‘com.example’
OR rev_domain LIKE ‘com.example.%’

SELECT rev_domain
WHERE rev_domain = ‘com.example’
OR rev_domain LIKE ‘com.example.%’
LIMIT 1

– all
SELECT domain
WHERE rev_domain = ‘com.example’
OR rev_domain LIKE ‘com.example.%’

You might be able to write a procedure to convert regular
representation to reverse representation, make that a calculated
column using the procedure and create an index on that calculated
column. Alternatively calculate the value via a trigger. Then your
insertion and update logic can stay as is.

domacl1 = “example.com

def match?(domacl,dom)
if domacl.start_with?(“.”)
return ( dom.reverse.start_with?(domacl.reverse.chop) )
else
return true if dom.eql?(domacl)
return false
end
end

You can make that simpler and I guess also a tad more efficient:

def match? domacl, dom
dom == domacl ||
domacl[0] == ‘.’ && dom.reverse.start_with?(domacl.reverse.chop)
end

Note: this is not exactly identical from a logical point of view - I
just assume that you do not have a dom which starts with “.”.

so it works…

:slight_smile:

my goal is maximum efficiency.

Yes, but you should not loose sight of functionality.

Kind regards

robert

On 7/5/2012 10:03 AM, Robert K. wrote:

On Thu, Jul 5, 2012 at 4:13 AM, Eliezer C. [email protected] wrote:

thanks in advance i need a bit help to break the ice that my head is in.

i am working on ICAP server and i am building some acls for content
filtering.

Ah, interesting!

i have used greasyspoon as icap server but it’s too much for my needs.
it takes a lot of memory and cpu for nothing so i started writing my own
on ruby and i then found out that i can add some nice features to it.

± example
± google

i will use mysql\pgsql so i will use mysql+memcached or pgsql with shm.
i will be more then happy to hear about how to implement such a forest
just for a case i will need it later.
i am thinking of loading some if not all acls into memory and in this
case i will need to use it.

Since you have a relational database, I would store domains in reverse
OR rev_domain LIKE ‘com.example.%’
column using the procedure and create an index on that calculated
column. Alternatively calculate the value via a trigger. Then your
insertion and update logic can stay as is.

i was thinking about the same idea but it will much more simple for me
to store the domain in a full reverse such as “moc.elpmaxe”.
i have two objectives i want to achieve:

  1. a dstdomain acl like in squid for simple allow\deny.
  2. a dstdomain from squidGuard blacklists to block porn spyware and
    others based on category.

the blacklists are updated via a txt file with only one match domain or
domain wildcard per line and i will use it as is.
so i will just use:
LOAD DATAT INFILE ‘/tmp/porndoms.txt’
INTO TABLE porn (@var1)
SET dom= REVERSE(@var1);

this is not suppose to be a “readable” field and it’s 30MB+ size so i
dont really care how it’s ordered in the table.
about acls that the admin writes in “acltable” this is another story
because it’s most of the time very sort and must be readable for the
admin as a human.
the options for the admins is either create a db front end web\cli\other
or to store the two fields dom and reversedom with index only on the
reversedom.

domacl1 = “example.com
You can make that simpler and I guess also a tad more efficient:

def match? domacl, dom
dom == domacl ||
domacl[0] == ‘.’ && dom.reverse.start_with?(domacl.reverse.chop)
end

Note: this is not exactly identical from a logical point of view - I
just assume that you do not have a dom which starts with “.”.

i heard this podcast about programing and this guy said “it’s better to
have code that works and then improve it then not having code at all”

as i will use a reversed domain stored almost in every part of the code
i will make it:

@param domacl = reversed domain acl

@param dom = reversed domain to check

def match? domacl, dom
dom == domacl ||
domacl.endwith?“.” && dom.start_with?(domacl.chop)
end

robert

Thanks,
Eliezer

On Thu, Jul 5, 2012 at 10:40 PM, Eliezer C. [email protected]
wrote:

Ah, interesting!

i have used greasyspoon as icap server but it’s too much for my needs.
it takes a lot of memory and cpu for nothing so i started writing my own on
ruby and i then found out that i can add some nice features to it.

That sounds fun!

com
± example
± google

i will use mysql\pgsql so i will use mysql+memcached or pgsql with shm.
i will be more then happy to hear about how to implement such a forest just
for a case i will need it later.

Well, basically you need a structure of nested Hashes. Matching would
start at the TLD and descend to the most specific domain.

i am thinking of loading some if not all acls into memory and in this case i
will need to use it.

That’s certainly faster if memory is sufficient for the size of lists
you want to handle.

You might be able to write a procedure to convert regular
representation to reverse representation, make that a calculated
column using the procedure and create an index on that calculated
column. Alternatively calculate the value via a trigger. Then your
insertion and update logic can stay as is.

i was thinking about the same idea but it will much more simple for me to
store the domain in a full reverse such as “moc.elpmaxe”.

Why?

irb(main):020:0> “foo.bar.baz”.split(/./).reverse
=> [“baz”, “bar”, “foo”]
irb(main):021:0> “foo.bar.baz”.split(/./).reverse.join(‘.’)
=> “baz.bar.foo”

i have two objectives i want to achieve:

  1. a dstdomain acl like in squid for simple allow\deny.
  2. a dstdomain from squidGuard blacklists to block porn spyware and others
    based on category.

Well, if you make “allow” and “deny” (or only “deny”) a category then
it’s just one mechanism. :slight_smile:

the blacklists are updated via a txt file with only one match domain or
domain wildcard per line and i will use it as is.
so i will just use:
LOAD DATAT INFILE ‘/tmp/porndoms.txt’
INTO TABLE porn (@var1)
SET dom= REVERSE(@var1);

this is not suppose to be a “readable” field and it’s 30MB+ size so i dont
really care how it’s ordered in the table.

Order in table rarely matters with RDBMS. It’s the indexes which are
important.

about acls that the admin writes in “acltable” this is another story because
it’s most of the time very sort and must be readable for the admin as a
human.

Isn’t it more important that the UI presents human readable
information? But I agree, readable DB contents helps in debugging
etc. That’s why I suggested the reversed format based on domains.

Note: this is not exactly identical from a logical point of view - I
just assume that you do not have a dom which starts with “.”.

i heard this podcast about programing and this guy said “it’s better to have
code that works and then improve it then not having code at all”

I’d be careful with that. That philosophy only works for small
systems where it is easy to do a major rewrite. In other cases it
might bring you into a situation where you are stuck with an
architecture that does not fit the needs.

Cheers

robert

On 7/6/2012 9:21 AM, Robert K. wrote:

On Thu, Jul 5, 2012 at 10:40 PM, Eliezer C. [email protected] wrote:

i have used greasyspoon as icap server but it’s too much for my needs.
it takes a lot of memory and cpu for nothing so i started writing my own on
ruby and i then found out that i can add some nice features to it.

That sounds fun!

it’s nice because the main reason was to coordinate two cache proxies
together.
in order to cache dynamic content (youtube and some others) i needed to
“fake” request on one proxy and then on the other when requested the
fake then serv the real one.
so i have used ICAP to get the original url from the first and store it
in memDB\sql then rewrite a fake url and send it back.
the proxy have explicit rule that direct the fake domain request through
proxy2 and then proxy2 request the same ICAP server on the url.
the ICAP server will then rewrites the fake url to the original one.

the first cache thinks it gets the fake url and stores it in mem.
the second will get the real one for proxy1.
what i benefit? the dynamic content is stored as a “static” url in cache
and can be served to other users.

store the domain in a full reverse such as “moc.elpmaxe”.

Why?

irb(main):020:0> “foo.bar.baz”.split(/./).reverse
=> [“baz”, “bar”, “foo”]
irb(main):021:0> “foo.bar.baz”.split(/./).reverse.join(‘.’)
=> “baz.bar.foo”

instead of spliting and reversing just one reverse will be lower cpu.
it will give me the same function…

i have two objectives i want to achieve:

  1. a dstdomain acl like in squid for simple allow\deny.
  2. a dstdomain from squidGuard blacklists to block porn spyware and others
    based on category.

Well, if you make “allow” and “deny” (or only “deny”) a category then
it’s just one mechanism. :slight_smile:

yes indeed it is one mechanism that up and running as we speak.
but the dstdomain and blacklists is not the same…
dstdomain is example.com matches only this domain and not any others but
in blacklists example.com will match also subdomains but not
1example.com.
so i had to add some condition to it.

Order in table rarely matters with RDBMS. It’s the indexes which are important.

ordered …as organized … as stored…
POTATO POTATO

about acls that the admin writes in “acltable” this is another story because
it’s most of the time very sort and must be readable for the admin as a
human.

Isn’t it more important that the UI presents human readable
information? But I agree, readable DB contents helps in debugging
etc. That’s why I suggested the reversed format based on domains.

yes indeed the point is that the UI will present it but as i am not
writing any UI right now and also because if the admin dont know how to
work with command line it will be very hard to do something with the
server in his current state.
the server still has some exceptions here and there that i have found.
and speak of the devil: if i want to log ruby exceptions into a specific
file. do yo now a thing about it?

robert

well this is the reason i am trying to:

  1. make it more modular by using methods that can be changed easily
  2. thing about efficiency.
  3. consult with others.

for now there is one guy that requested me for that ACL of deny\allow
per ldap group policy.
so my main goals now are:

  1. fix bugs to make it bug free( i have some that i know of and might
    have others that i dont ).
  2. add a more accurate url match filtering then just host\domain.
  3. add user\ip db integration for future filtering\acl capabilities.
  4. improve the filtering based on categories\level.
  5. add a form that will allow a user to report a false-positive to the
    admin.
  6. add a “user custom allowed\denied domains\urls list”.
  7. create a category option for the “custom allowed\denied domains\urls”
    so a user\admin can add to a user specific allowed categories.
    for the above option i must really think more before implementing the
    filtering acls as levels or categories etc…
  8. content auditor module
    ( i had in mind to add an option of “content
    inspector\inspecting\auditing”.
    what i mean is to add a feature that will log requests
    urls\domains\pages on a db so some human inspection on the content later
    can be done.
    so in environment like small isp\office that want to build his own
    blacklists\categories based on users browsing experience\habits the
    “content auditor” will get the list from the the DB somehow. )
  9. live urls\domains access statistics on a DB for admins.
    (squid has logs but not live statistics)

i had just one simple goal and it became more then just that and i’m
happy for that.

any ideas on the subjects?

Eliezer

On Sat, Jul 7, 2012 at 5:32 AM, Eliezer C. [email protected]
wrote:

  1. add a more accurate url match filtering then just host\domain.
    ( i had in mind to add an option of “content inspector\inspecting\auditing”.

any ideas on the subjects?

There’s probably so much that can be said to all of them but I am
lacking time. My first impression was to proceed like this

  1. write down all the business requirements in a structured way (for
    example, it seems to me that you want something like global and local
    lists although I don’t see that explicitly mentioned)
  2. make a designing session to come up with an architecture
  3. find out how you get from what you have to the new design /
    architecture

For example: for me it is not clear why you need a RDBMS in there if
you do not plan for large lists and plan to use its features.

Kind regards

robert

On 7/10/2012 12:08 PM, Robert K. wrote:

  1. fix bugs to make it bug free( i have some that i know of and might have
    filtering acls as levels or categories etc…
    i had just one simple goal and it became more then just that and i’m happy
  1. make a designing session to come up with an architecture
  2. find out how you get from what you have to the new design / architecture

For example: for me it is not clear why you need a RDBMS in there if
you do not plan for large lists and plan to use its features.

Kind regards

robert

Thanks Robert,

“make a designing session to come up with an architecture” ? with who?
i’m by self on it.

it is not much of a “business” for me.
the main reason i wrote this server was one project specifically to help
two cache proxy coordinate some data on urls.

as i was building it i have seen that it’s pretty simple to add these
features.

if i had some sponsor for this project i will be more then happy.

about the RDBMS or a DB is more for persistent storage and interface
integration later.
and also the lists i have are now 1 Million rows long only for one
category.
i intend to make it efficient and to use some DB to add new sites into
the to be checked\categorized list.

i was working with a big filtering system before but it was designed by
a “big” company that offers a complete solution for http\ssl\https\p2p
filtering and shaping etc on one machine.

i need someone to work on it with.

as i was testing my server it seems like 4 workers can work under a load
of more then concurrent 25000 requests and about 4000 requests per
second on a dumb intel atom with 2GB ram.

Thanks,
Eliezer

On 7/11/2012 9:40 AM, Jes

On Wed, Jul 11, 2012 at 4:37 AM, Eliezer C. [email protected]
wrote:

about the RDBMS or a DB is more for persistent storage and interface
integration later.
and also the lists i have are now 1 Million rows long only for one category.
i intend to make it efficient and to use some DB to add new sites into the
to be checked\categorized list.

If you are not really using the relational part of a RDBMS, and you
only need some kind of persistence for “data structures”, I would
recommend taking a look at Redis . It’s a very fast storage engine
with first level support for lists, sets, hashes, etc: http://redis.io

Jesus.

On Wed, Jul 11, 2012 at 9:22 AM, Eliezer C. [email protected]
wrote:

the
does it support indexing?reverse indexing?
It’s basically a key - value data store, in which the value can be a
data structure: string, list, hash, set, sorted set, with first level
support for atomic operations on those values: add to a set, get the
highest score value in a sorted set, etc.
So, no, it doesn’t support indexing as you would think about indexes
in a relational database. But if your data structures fit one of the
supported types, your life is made a lot easier out of the box.

Jesus.

On 7/11/2012 10:49 AM, Jes

On Wed, Jul 11, 2012 at 6:18 PM, Eliezer C. [email protected]
wrote:

i indeed looked at it.
it seems to fit my cache needs but not the filtering.
it really simplified many things for me with the “expire”(setex) method.
i need to store maximum of 3-6 minutes the key and the url so it fit’s like
hell to me.

Glad to hear that.

for a more robust lookups i need DB and i can cache the results in redis
because it’s a one way ticket per domain for a long time.
i can add a domain result for half a day and it will do a really good work.

my main problem is that i am using forks and redis ruby library dont like
it.
and also my eventmachine server has a “clenup” nasty method that releases
any old object when ending session so i do have a problem with it.

if you have an idea on how to handle REDIS with forks i will be so happy.

I have no experience with the redis ruby library (I’ve only used it in
Java), but I guess the problem is that the library keeps a connection
open, and when you fork, reusing this connection across processes is a
problem. I don’t know how this can be solved. If you are using this
one GitHub - redis/redis-rb: A Ruby client library for Redis, it says to seek for help here:
“You can also ask for help at #redis-rb on Freenode.”

Maybe you could try to open the connection in each process
individually after the fork.

Jesus.

Jesus.

On 7/12/2012 9:34 AM, Jes

On Wed, Jul 11, 2012 at 4:37 AM, Eliezer C. [email protected]
wrote:

“make a designing session to come up with an architecture” ? with who? i’m
by self on it.

Well, I did not necessarily mean that you need to do it with another
person (although that usually helps).

it is not much of a “business” for me.

Well, even if it’s hobby, there are “business requirements”, i.e.
those that are related to the task to solve (as opposed to, e.g.,
technical requirements like operating system).

i need someone to work on it with.

Well, maybe you make it an open source project and ask for
collaborators.

as i was testing my server it seems like 4 workers can work under a load of
more then concurrent 25000 requests and about 4000 requests per second on a
dumb intel atom with 2GB ram.

That sounds pretty good already.

Kind regards

robert

On Thu, Jul 12, 2012 at 6:30 PM, Eliezer C. [email protected]
wrote:

i also added a “cache” db for my filtering engine using redis and it lowered
the disk\db access by hundreds percents.

That is impossible. You can reduce by hundred percent at most.

Kind regards

robert

On 7/12/2012 9:59 PM, Robert K. wrote:

Well, even if it’s hobby, there are “business requirements”, i.e.

That sounds pretty good already.

Kind regards

robert

well it is an open-source project at: GitHub - elico/echelon: Event Machine based ICAP Server
i did some upgrades to the code that i didn’t committed yet and i’m
having some problem understanding how to work on something with
eventmachine.

i will try to put together a list of things such as requirements.

as for now i am looking for a more experienced programmer that have used
EventMachine and can give me a hint or two.

and about the disk access you mentioned in the other mail.
it is not possible by the sense of 100% = disk access.
but i was talking about a cache it over disk hit.
if say a disk hit = 19 requests the 19 is the 100% disk access.
so in my case if i get 950 requests that was suppose to have disk access
and only 19 had disk access i have more then 100% cache hit ratio.
to be exact i had about 10,800 queries and from them only 2,008 was
served from disk.
so it’s a cache hit ratio of 80%,
so 100% disk access is 2,000 and 400% is compared to 100% disk access.

i will post another mail about my eventmachine struggle.

Thanks,
Eliezer

On 7/13/2012 9:34 AM, Robert K. wrote:

well you are right and i just tried to explain what i meant.
usually i am using the regular “hit ratio” percentage.
as i do work with cache proxy a lot i am used to the 100%.

Regards,
Eliezer

Please trim your quotes to contain only that part you are actually
referring to.

On Thu, Jul 12, 2012 at 9:39 PM, Eliezer C. [email protected]
wrote:

but i was talking about a cache it over disk hit.
if say a disk hit = 19 requests the 19 is the 100% disk access.
so in my case if i get 950 requests that was suppose to have disk access and
only 19 had disk access i have more then 100% cache hit ratio.

No, you don’t. A cache hit ratio is always below 100%. You can reach
100% by preloading a cache but you can never have more than 100%
because a cache hit ratio is calculated as requests served from cache
divided by all accesses. You are contradicting yourself with what you
write below:

to be exact i had about 10,800 queries and from them only 2,008 was served
from disk.
so it’s a cache hit ratio of 80%,

Correct.

so 100% disk access is 2,000 and 400% is compared to 100% disk access.

You are calculating the reciprocal of the cache hit ratio and call it
“cache hit ratio” as well. That does not make sense. I am unaware of
a name for this metric but I believe it’s better to stick to the
standards so people understand what you mean. That’s all language
conventions and semantic is about.

Cheers

robert