Regex problem

jason_sam · June 2, 2014, 7:52pm

Hello,

I try to find a regex which find only the domain name without the .com
or .nl

I tried :

\w{3+}

But on http://www.tamarawobben.nl/testhier it finds http tamarawobben
and testhier

How to solve this ?

Roelof

roelof · June 2, 2014, 9:19pm

On 14-06-02, 10:51, Roelof W. wrote:

I try to find a regex which find only the domain name without the .com
or .nl

I tried :

\w{3+}

You need to limit your expression more, your example will find ANY 3
“word” characters, even numbers and punctuation, and even if there are
MORE word characters around them.

You could do this with a capture group (here, named “tld”):

 regex = %r{
   \.            # one dot,
   (?<tld>       # capture as "tld":
     [a-z]{3}+   # 3+ alpha characters (note: not \w)
   )
   $             # at the end of a line/string
 }x
 "http://example.com".match(regex)[:tld]

Or with a positive look-behind:

 %r{
   (?<=\.)    # lookbehind for one dot,
   [a-z]{3}+  # match 3+ alpha characters
   $          # at the end of a line/string
 }
 "http://example.com".match(regex)

Another approach is using the URI library:

 URI.parse("http://example.com/").host.split(".").last

Andrew V.

roelof · June 2, 2014, 11:15pm

rubular.com is a great site for testing regexes. Here is one for the
last
regex given by Andrew V.:

Good luck

roelof · June 2, 2014, 9:33pm

Andrew V. schreef op 2-6-2014 21:18:

regex =

Thanks,

But if I try all three on a online ruby intepreter they do not give any
answer.

Roelof

roelof · June 3, 2014, 8:41am

On Mon, Jun 2, 2014 at 7:51 PM, Roelof W. [email protected] wrote:

How to solve this ?
That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

Kind regards

robert

roelof · June 2, 2014, 11:18pm

On Mon, Jun 2, 2014 at 2:18 PM, Andrew V. [email protected] wrote:

You need to limit your expression more, your example will find ANY 3
$ # at the end of a line/string
“http://example.com”.match(regex)

Another approach is using the URI library:
URI.parse("http://example.com/").host.split(".").last
Andrew V.

OP seems to want the domain name without the TLD or subdomains. Andrew’s
URI solution actually seems quite the best (really no sense in rewriting
well-written regexps), but instead of the last part of the host, you’ll
want the penultimate part. Several ways you can get that. Here’s one:

URI.parse("http://www.tamarawobben.nl/testhier").host.split(".")[-2]

#=>
“tamarawobben”

roelof · June 3, 2014, 10:12am

Roelof W. schreef op 3-6-2014 8:55:

When I do (<?=/[.|/]) or (<?=/[.|//]) I see a message that I have to
excape the /

Roelof

I tried this one (?<=[.|//)(.*?)(?=.)
but still the error message taht there are un escaped backslashes .

Roelof

roelof · June 3, 2014, 8:56am

Robert K. schreef op 3-6-2014 8:41:

testhier

How to solve this ?
That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

Kind regards

robert

Im a little bit further.
I have this : (?<=.)(.*?)(?=.)

it seems to work except I have to tell that on the .*? the / is not
included.
And on the (<?=/.) I have to find a way to include the //

When I do (<?=/[.|/]) or (<?=/[.|//]) I see a message that I have to
excape the /

Roelof

roelof · June 3, 2014, 1:29pm

On Tue, Jun 3, 2014 at 10:11 AM, Roelof W. [email protected] wrote:

Roelof W. schreef op 3-6-2014 8:55:

Robert K. schreef op 3-6-2014 8:41:

On Mon, Jun 2, 2014 at 7:51 PM, Roelof W. [email protected] wrote:

…

I tried this one (?<=[.|//)(.*?)(?=.)
but still the error message taht there are un escaped backslashes .

Please stop fullquoting - especially if you are not referring in any
way to the quoted text. Thank you.

Regards

robert

roelof · June 3, 2014, 6:36pm

On 14-06-02, 23:41, Robert K. wrote:

That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

URI can extract from larger texts (URI.extract), parse URIs (URI.parse),
and after that it’s easy to split the domain parts from the
fully-qualified hostnames. Really, I don’t think there’s any point in
reinventing this using a Regexp… unless it’s just a learning exercise.

Andrew V.

roelof · June 3, 2014, 7:15pm

On Tue, Jun 3, 2014 at 6:50 PM, Roelof W. [email protected] wrote:

But I think I will use a regex for finding the full domain and then use
http://.tamarawobben.nl/index.html

where all three tamarawobben.nl must be found.

Roelof

I tried to do it with a single regexp and I couldn’t do anything
useful, so I tried to do it first with a regexp to extract the part
between the slashes (between http:// and the following /) and then use
split on “.” to the result. This way is quite simpler. I’m not going
to give you the solution, so you can try a little bit this approach,
as this is a learning exercise.

Let me know if you get stuck.

Jesus.

roelof · June 3, 2014, 6:50pm

<html

xmlns=“XHTML namespace”>

Op 3 juni 2014 om 18:36 schreef Andrew V. <[email protected]>:

On 14-06-02, 23:41, Robert Klemme wrote:
> That depends on your input. Do you want to find those domain names in
> a larger text? Do you try to parse URIs? Do you have full qualified
> domain names from which you want to extract a portion?

URI can extract from larger texts (URI.extract), parse URIs (URI.parse),
and after that it's easy to split the domain parts from the
fully-qualified hostnames. Really, I don't think there's any point in
reinventing this using a Regexp... unless it's just a learning exercise.

Andrew V.

This is a learning exercise from codewars.

But I think I will use a regex for finding the full domain and then use split to find only the part before the .com and so on.

I tried and I think its very difficult to find a regex which can solve all these problems.

http:///www.tamarawobben.nl/index.html

http://tamarawobben.nl/index.html

http://<subdomain>.tamarawobben.nl/index.html

where all three tamarawobben.nl must be found.

Roelof