I was recently wondering if I should filter URL’s by characters to only
allow what is standard in applications.
Words, Numbers, and couple characters [.-_/]. We know the list of
supported URL’s and Domains is really just a subset of ASCII
http://perishablepress.com/stop-using-unsafe-characters-in-urls/.
However, I’m not totally sure what nginx does when I pass “µ” to it.
I came up with a simple regular expression to match something that isn’t
one of those:
location ~* "(UTF8)([^\p{L}\p{N}/.-%\]+)" ) {
if ($uri ~ “(*UTF8)([^\p{L}\p{N}/.-%\]+)” ) {
However, I’m wondering if I actually need to use the UTF-8 matching
since
clients should default to URL encoding (%20) or hex encoding (\x23) the
bytes and the actual transfer should be binary anyway.
Here is an example test where I piped almost all 65,000 unicode points
to
nginx via curl:
https://gist.github.com/Xeoncross/acca3f09c5aeddac8c9f
For example: $ curl -v http://localhost/与
Basically, is there any point to watching URL’s for non-standard
sequences
looking for possible attacks?
( FYI: I posted more details that led to this question here:
http://stackoverflow.com/questions/28055909/does-nginx-support-raw-unicode-in-paths
)