Email Address Regex

lukfugl · January 4, 2006, 12:45am

On 1/3/06, Dan K. [email protected] wrote:

Here’s a rails example for validating email addresses.

validates_format_of :login, :with => /
^[-^!$#%&'+/=?`{|}~.\w]+
@a-zA-Z0-9
(.a-zA-Z0-9*)+$/x,
:message => “must be a valid email address”,
:on => :create

Be careful with email validation via regex, it’s harder than you might
think[1][2]:

/^([a-zA-Z0-9&?/!|#*$^%=~{}+'-]+|"([\x00-\x0C\x0E-\x21\x23-\x5B\x5D -\x7F]|\\[\x00-\x7F])*")(\.([a-zA-Z0-9&_?\/!|#$^%=~{}+'-]+|"([\x00-
x0C\x0E-\x21\x23-\x5B\x5D-\x7F]|\[\x00-\x7F])"))*@([a-zA-Z0-9&?/! |#*$^%=~{}+'-]+|\[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|\\[\x00-\x7F])*\])(\. ([a-zA-Z0-9&_?\/!|#$^%=~{}+'-]+|[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|\[
\x00-\x7F])]))*$/

Jacob F.

[1] From
http://phantom.byu.edu/pipermail/uug-list/2004-January/009707.html
[2] That regex needs some serious /x treatment, which I didn’t know
about at the time it was written.

lukfugl · January 4, 2006, 11:24am

http://tfletcher.com/lib/rfc822.rb

(doesn’t look quite as messy

lukfugl · January 4, 2006, 1:15pm

Hi –

On Wed, 4 Jan 2006, Jacob F. wrote:

Be careful with email validation via regex, it’s harder than you might
think[1][2]:

/^([a-zA-Z0-9&?/!|#*$^%=~{}+'-]+|"([\x00-\x0C\x0E-\x21\x23-\x5B\x5D -\x7F]|\\[\x00-\x7F])*")(\.([a-zA-Z0-9&_?\/!|#$^%=~{}+'-]+|"([\x00-
x0C\x0E-\x21\x23-\x5B\x5D-\x7F]|\[\x00-\x7F])"))*@([a-zA-Z0-9&?/! |#*$^%=~{}+'-]+|\[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|\\[\x00-\x7F])*\])(\. ([a-zA-Z0-9&_?\/!|#$^%=~{}+'-]+|[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|\[
\x00-\x7F])]))*$/

See also: Mail::RFC822::Address

David

–
David A. Black
[email protected]

“Ruby for Rails”, from Manning Publications, coming April 2006!

lukfugl · January 4, 2006, 2:16pm

On Jan 4, 2006, at 12:47, Andreas S. wrote:

|#$^%=~{}+’-]+|[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|\[\x00-\x7F])])
(.
([a-zA-Z0-9&_?/`!|#$^%=~{}+’-]+|[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|
\[
\x00-\x7F])]))*$/

It is trivial to create a formally correct address that makes
absolutely
no sense, so what’s the point of doing such a complicated and
error-prone validation?

Job security? I mean, without pointer arithmetic and its associated
mysteries (negative array indices were a personal favourite), we need
something to keep us gainfully employed!

matthew smillie.

lukfugl · January 4, 2006, 6:12pm

By “error prone” do you mean that it won’t detect addresses that don’t
exist?

Is it not still better to catch some errors than none at all?

lukfugl · January 4, 2006, 1:47pm

Jacob F. wrote:

On 1/3/06, Dan K. [email protected] wrote:

Here’s a rails example for validating email addresses.

validates_format_of :login, :with => /
^[-^!$#%&'+/=?`{|}~.\w]+
@a-zA-Z0-9
(.a-zA-Z0-9*)+$/x,
:message => “must be a valid email address”,
:on => :create

Be careful with email validation via regex, it’s harder than you might
think[1][2]:

/^([a-zA-Z0-9&?/!|#*$^%=~{}+'-]+|"([\x00-\x0C\x0E-\x21\x23-\x5B\x5D -\x7F]|\\[\x00-\x7F])*")(\.([a-zA-Z0-9&_?\/!|#$^%=~{}+'-]+|"([\x00-
x0C\x0E-\x21\x23-\x5B\x5D-\x7F]|\[\x00-\x7F])"))*@([a-zA-Z0-9&?/! |#*$^%=~{}+'-]+|\[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|\\[\x00-\x7F])*\])(\. ([a-zA-Z0-9&_?\/!|#$^%=~{}+'-]+|[([\x00-\x0C\x0E-\x5A\x5E-\x7F]|\[
\x00-\x7F])]))*$/

It is trivial to create a formally correct address that makes absolutely
no sense, so what’s the point of doing such a complicated and
error-prone validation?

lukfugl · January 4, 2006, 6:17pm

Tim F. wrote:

By “error prone” do you mean that it won’t detect addresses that don’t
exist?

No, I mean that it might declare some addresses invalid although they
aren’t.

lukfugl · January 4, 2006, 6:13pm

On 1/4/06, Tim F. [email protected] wrote:

http://tfletcher.com/lib/rfc822.rb

(doesn’t look quite as messy

Yeah, as I said in the footnote, the regex I posted needed some
readability treatment. Yours looks pretty nice, and exactly equivalent
except for a typo in quoted_pair:

quoted_pair = ‘\x5c\x00-\x7f’

quoted_pair = ‘\x5c[\x00-\x7f]’

Jacob F.

lukfugl · January 4, 2006, 6:17pm

On 1/4/06, [email protected] [email protected] wrote:

See also: Mail::RFC822::Address

Yeah, I’ve seen that one as well. My regex is only meant to match the
definition of an ‘addr-spec’ token (described as “global” or “simple”
address) in section 6.1 of the RFC822 grammar, as opposed to a
‘mailbox’ or ‘address’. I figure people aren’t going to type the “John
Doe [email protected]” format into a form, nor named lists (‘group’ token
in the grammar).

Jacob F.

lukfugl · January 4, 2006, 6:26pm

On 1/4/06, Andreas S. [email protected] wrote:

Tim F. wrote:

By “error prone” do you mean that it won’t detect addresses that don’t
exist?

No, I mean that it might declare some addresses invalid although they
aren’t.

You’ll see from my comments in the original post[1] and in my reply to
David Black in the other thread[2] that this regex is indeed compliant
with a single, non-named address as defined by the RFC[3].

Jacob F.

[1] http://phantom.byu.edu/pipermail/uug-list/2004-January/009707.html
[2] [ruby-talk:174081]
[3] RFC 822 - STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES (RFC822)

lukfugl · January 4, 2006, 6:56pm

Jacob F. wrote:

On 1/4/06, Andreas S. [email protected] wrote:

Tim F. wrote:

By “error prone” do you mean that it won’t detect addresses that don’t
exist?

No, I mean that it might declare some addresses invalid although they
aren’t.

You’ll see from my comments in the original post[1] and in my reply to
David Black in the other thread[2] that this regex is indeed compliant
with a single, non-named address as defined by the RFC[3].

Possibly. Still, I prefer a simple solution over a complicated one. What
type of errors do you hope to catch with this huge regex? Typing errors?
Deliberately entered rubbish? The regex accepts just about anything with
a “@”, e.g. “$@$”.

lukfugl · January 4, 2006, 7:21pm

On 1/4/06, Andreas S. [email protected] wrote:

David Black in the other thread[2] that this regex is indeed compliant
with a single, non-named address as defined by the RFC[3].

Possibly. Still, I prefer a simple solution over a complicated one. What
type of errors do you hope to catch with this huge regex? Typing errors?
Deliberately entered rubbish? The regex accepts just about anything with
a “@”, e.g. “$@$”.

Not possibly. Gauranteed. It’s compliant to the portions of the RFC I
mentioned.

Still, I’ll concede it doesn’t prevent rubbish from being entered. The
domain of valid email addresses is much larger than the domain of
actual email addresses. I’m not claiming that this regex should even
be used for form validation. I dislike email validation period. My
intent in first writing the regex two years ago and bringing it up
again now is mostly:

To show off my regex-fu
To demonstrate the inadequacy of simplistic regexes for email
validation.

For instance, I’ll often use the “name+tag@domain” construct to filter
mail and/or determine who’s selling my address. When I find a form
that claims that email address is invalid, I get upset. As such, I’ve
taken it as my own personal crusade to punch down inadequate email
validations whenever I see them. My method is to demonstrate a regex
that does allow valid addresses. My first hope is that they’ll notice
the futility and just remove the email address validation altogether.
If that fails, I hope they’ll actually use the compliant regex.

The only reason I defended the regex was because you claimed it was
invalid. If you’re original argument had been that the regex was
unnecessary, I’d probably have agreed with you. Validating email
addresses by form is pointless. If someone doesn’t want to give you
their address, they won’t. Requiring them to input a valid fake
address instead of an invalid fake address doesn’t improve your data
at all. The only reason I can see that being necessary is to prevent
malformed addresses from breaking your application in some way. But if
that’s a problem, fix the application, not the email address.

Jacob F.

lukfugl · January 4, 2006, 7:30pm

From: “Jacob F.” [email protected]

I figure people aren’t going to type the “John
Doe [email protected]” format into a form

In my experience, a certain percentage do. I’m guessing
it might be because they copied their email address out
of something like Outlook Express, and pasted it into the
form. (OE will display “John D.” as a hyperlink, which if
selected and copied turns into “John D. [email protected]”
in the clipboard.)

Personally, right or wrong, to catch that I just reject
email addresses with a “<” or “>” in them. I’ll admit I
don’t really care if some spec says it’s possible to legally
form email addresses with those characters. That may make
me a bad person. But whoever wrote that spec should be
infested with the fleas of 1000 camels.

Regards,

Bill

lukfugl · January 4, 2006, 8:15pm

Quoting “Andreas S.” [email protected]:

It is trivial to create a formally correct address that makes
absolutely no sense, so what’s the point of doing such a
complicated and error-prone validation?

Well, I might actually have one.

The comment form on my web site sends email directly to me; as a
convenience, the email address entered on the form becomes the
email’s From address (I can see who it’s from and reply more
easily).

Now, doing that would open up all sorts of injection attacks if I
didn’t do any validation. So I do a quick and paranoid (syntactic)
validity check – if the address fails, then it is included in the
body of the message instead of a header field.

In this case, a nonsensical address is perfectly fine (I will see it
and know better), and it’s even okay if a valid address is rejected
(I’ll still get the message and be able to figure things out from
the body), but I have to be able to detect syntactically invalid
addresses.

-mental

lukfugl · January 4, 2006, 8:12pm

Bill K. wrote:

But whoever wrote that spec should be
infested with the fleas of 1000 camels.

He probably already is.

Hal

lukfugl · January 4, 2006, 8:30pm

Hal F. [email protected] writes:

Bill K. wrote:

But whoever wrote that spec should be
infested with the fleas of 1000 camels.

He probably already is.

Hal

In any case, many of the syntax put in rfc822 has been obseleted in
rfc2822.

The complexity of RFC822 (year 1982) was because the need to
interoperate with wildly different systems. Consider that the RFC
title was: “STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT
MESSAGES”. As if there was other Internet such that it was needed to
specify which Internet.

Consider the title for RFC2822 (year 2001): “Internet Message Format”
where it was already clear that the ARPA Internet was the winner and
thus can afford to simplify the address syntax.

YS.

lukfugl · January 4, 2006, 10:31pm

On Thu, Jan 05, 2006 at 03:27:48AM +0900, Bill K. wrote:

in the clipboard.)
It’s possible that someone might copy and paste something in that format
from a number of other, non-Windows email clients, too – though it’s
more difficult to do so by accident, since generally clients like mutt
won’t drop more stuff into your copy/paste buffer than you actually
highlighted.

–
Chad P. [ CCD CopyWrite | http://ccd.apotheon.org ]

unix virus: If you’re using a unixlike OS, please forward
this to 20 others and erase your system partition.

lukfugl · January 4, 2006, 11:29pm

Here’s my useful form validation:

/^\s*([-a-z0-9&’+./=?^_{}~]+@(a-z0-9?.)+[a-z]{2,5}\s(,\s*|\z))+$/i

It may not catch EVERYTHING, but should work just fine for most
people. It will allow multiple email addresses separated by commas.

I figure if you want to go beyond that, a verification system would be
the next logical step.

-Jeff

lukfugl · January 4, 2006, 10:00pm

Jacob F. wrote:

The only reason I defended the regex was because you claimed it was
invalid.

I don’t remember that. I dislike complex solutions like this Regex
because they are error prone (as proved by your correction for Tim’s
rfc822.rb), I didn’t claim yours was invalid.

If you’re original argument had been that the regex was
unnecessary, I’d probably have agreed with you. Validating email
addresses by form is pointless. If someone doesn’t want to give you
their address, they won’t. Requiring them to input a valid fake
address instead of an invalid fake address doesn’t improve your data
at all. The only reason I can see that being necessary is to prevent
malformed addresses from breaking your application in some way. But if
that’s a problem, fix the application, not the email address.

I totally agree with you.

lukfugl · January 4, 2006, 11:32pm

On Jan 4, 2006, at 1:27 PM, Bill K. wrote:

Personally, right or wrong, to catch that I just reject
email addresses with a “<” or “>” in them. I’ll admit I
don’t really care if some spec says it’s possible to legally
form email addresses with those characters. That may make
me a bad person.

It doesn’t make you a bad person but it certainly makes
your application less interoperable than it might be.
For example, the vast majority of email addresses on this
mailing list are of the form:

Bill K. <[email protected]>

In some GUI environments it is harder to select the portion
between the <>'s than to select the entire address.

From RFC 1123:

At every layer of the protocols, there is a general rule whose
application can lead to enormous benefits in robustness and
interoperability:

 "Be liberal in what you accept, and conservative
  in what you send"

Gary W.