Extract domain name

cmcfab · August 20, 2010, 6:04am

All,

I have the same basic issue as discussed in this thread last year:
http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/77cfd3250a633c7e/0430f50818a01ddd.

Justin C. points out the greatest difficulty with the situation,
i.e. that when dealing with a country code TLD, one may well have a
different number of parts (e.g. example.co.uk) than when dealing with
a gTLD (example.com).

The only solution that has occurred to me is to have a list of known
TLDs and second level domains (e.g. co.uk) that are insufficiently
specific, requiring a subdomain for additional specificity. The
problem is that this requires maintenance as well as initial research.

Does anyone have any suggestions for an alternative method to solve
this problem? I’m currently using Addressable:URI
(http://addressable.rubyforge.org/api/classes/Addressable/URI.html) to
parse the URLs and extract the host names.

cmcfab · August 20, 2010, 8:02am

Charles C. wrote:

All,

I have the same basic issue as discussed in this thread last year:
http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/77cfd3250a633c7e/0430f50818a01ddd.

Justin C. points out the greatest difficulty with the situation,
i.e. that when dealing with a country code TLD, one may well have a
different number of parts (e.g. example.co.uk) than when dealing with
a gTLD (example.com).

The only solution that has occurred to me is to have a list of known
TLDs and second level domains (e.g. co.uk) that are insufficiently
specific, requiring a subdomain for additional specificity. The
problem is that this requires maintenance as well as initial research.

Does anyone have any suggestions for an alternative method to solve
this problem? I’m currently using Addressable:URI
(http://addressable.rubyforge.org/api/classes/Addressable/URI.html) to
parse the URLs and extract the host names.

I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat

A stackoverflow question on the same topic:

Their solution is regex.

cmcfab · August 20, 2010, 9:46am

And remember that some things which look like domains have set
themselves up as registries - e.g. uk.com

cmcfab · August 23, 2010, 3:47am

On Fri, 20 Aug 2010 01:02:56 -0500, Mr zengr [email protected]
wrote in [email protected]:

[snip my question about extracting domain name (e.g. “example.com”
from “www.example.com”).

I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat

Wow. That’s a big help. Thanks.

A stackoverflow question on the same topic:
regex - How to get domain name from URL - Stack Overflow

Interesting.

Their solution is regex.

As the poster pointed out, matching everything leads to a huge regex,
which is likely to cause maintenance problems (though he indicated
that they started generating the regex from other data to address
that) and would make me concerned about resource allocation, though I
couldn’t find anything in the core Ruby Doc about a max length for a
regex.

On the other hand it might be more performant than looping through a
bunch of substring matches or matching against database records. I
sense some testing in my future.

Thanks,

cmcfab · August 23, 2010, 5:56pm

On Sun, 22 Aug 2010 22:35:31 -0500, Michael F.
[email protected] wrote in
[email protected]:

On Mon, Aug 23, 2010 at 10:40 AM, Charles C. [email protected] wrote:

On Fri, 20 Aug 2010 01:02:56 -0500, Mr zengr [email protected]
wrote in [email protected]:

[snip my question about extracting domain name (e.g. “example.com”
from “www.example.com”).

I think the best way will be actually match with a list of TLDs and
gTLDs.

[snip]

GitHub - pauldix/domainatrix: A cruel mistress that uses the public suffix domain list to dominate URLs by canonicalizing, finding the public suffix, and breaking them into their domain parts.

For a minute, I thought your reply was generated by a porn spam bot
until I saw github in the URL.

For those reading the thread, this is a gem that uses
http://publicsuffix.org/ to parse domain names and identify the suffix
(e.g. “com” or “co.uk”), domain, subdomains, etc. It extends
Addressable.URI.

That was very helpful. Thanks.

cmcfab · August 23, 2010, 5:36am

On Mon, Aug 23, 2010 at 10:40 AM, Charles C. [email protected]
wrote:

http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat
As the poster pointed out, matching everything leads to a huge regex,
which is likely to cause maintenance problems (though he indicated
that they started generating the regex from other data to address
that) and would make me concerned about resource allocation, though I
couldn’t find anything in the core Ruby Doc about a max length for a
regex.

On the other hand it might be more performant than looping through a
bunch of substring matches or matching against database records. Â I
sense some testing in my future.