Danger to Nginx from raw unicode in paths?

I was recently wondering if I should filter URL’s by characters to only
allow what is standard in applications.

Words, Numbers, and couple characters [.-_/]. We know the list of
supported URL’s and Domains is really just a subset of ASCII
http://perishablepress.com/stop-using-unsafe-characters-in-urls/.

However, I’m not totally sure what nginx does when I pass “µ” to it.

I came up with a simple regular expression to match something that isn’t
one of those:

location ~* "(UTF8)([^\p{L}\p{N}/.-%\]+)" ) {
if ($uri ~
“(*UTF8)([^\p{L}\p{N}/.-%\]+)” ) {

However, I’m wondering if I actually need to use the UTF-8 matching
since
clients should default to URL encoding (%20) or hex encoding (\x23) the
bytes and the actual transfer should be binary anyway.

Here is an example test where I piped almost all 65,000 unicode points
to
nginx via curl:

https://gist.github.com/Xeoncross/acca3f09c5aeddac8c9f

For example: $ curl -v http://localhost/与

Basically, is there any point to watching URL’s for non-standard
sequences
looking for possible attacks?

( FYI: I posted more details that led to this question here:
http://stackoverflow.com/questions/28055909/does-nginx-support-raw-unicode-in-paths
)

Hello!

In reference to your mail subject, one should note that “raw unicode”
does not exist. You should really understand what the term “unicode”
means, what the abstract meaning of unicode code points is, and what
UTF-8, for example, really is: it is just one of many possible ways to
encode characters into a raw byte representation. Again; there is no
such thing as “raw unicode”.

Other than that, you have already received a good answer on Stack
Overflow. So, what is your question, exactly?

As stated on SO, for nginx, a location is just a sequence of bytes. You
surely understand that the space of byte sequences (given a certain
length) is larger than just the 65.000 items that you have worked with.

From my naive point of view I would say: no, there definitely is no
point in looking out for “non-standard” sequences in the most general
sense, because there are just too many of them. Having a proper white
list approach (specify those locations that should work in a certain
way, and reject all other requests) is a very safe concept.

Cheers,

Jan-Philip


http://gehrcke.de