Nginx doesn't handle different URL encodings well

Hello Igor, hello all,

Congratulations for your fantastic and neatly programmed web server.
It’s a pleasure to use it.

I have a problem with nginx not serving files with accentuated
characters when the sumbitted URL is UTF-8 encoded.

Here is my nginx.conf : http://nginx.pastebin.com/aB7XRLM3 It’s a home
webserver that is primarily used to serve stuff like holiday photos.

For example, I have a file called “t-2008.jpg” on my webserver. When I
request http://myserver/t-2008.jpg, depending on whether the “Always
send URLs as UTF-8” checkbox is checked or not in the Internet Explorer
advanced options, the file is correctly served, or not.

When the URL is Latin-1 encoded, the request sent is : GET
/%e9t%e9-2008.jpg ----> nginx resolves this to “t-2008.jpg”, the file is
served, OK
When the URL is UTF-8 encoded, the request sent is : GET
/%C3%A9t%C3%A9-2008.jpg ----> nginx resolves this to “été-2008.jpg”, and
the file is not served. (file not found)

Shouldn’t a fallback mechanism be implemented so that when a file isn’t
found after an URL has been decoded, a second try is made with another
encoding ? I believe two RFCs are involved : rfc2396 and rfc3986 (info
given by PiotrSikora on IRC). IMO, nginx shouldn’t assume the URL it
gets are always following the same RFC. From what I know, this ambiguity
is resolved in Apache. Maybe they have that sort of fallback mechanism.

Thanks to the IRC channel members who pointed me towards this mailing
list. I look forward for your reply in order to know what to do :slight_smile:

Spassiba !

On Wed, 20 Oct 2010 21:23:46 -0400, Pierre-Marie B. wrote:

When the URL is Latin-1 encoded, the request sent is : GET
/%e9t%e9-2008.jpg ----> nginx resolves this to “été-2008.jpg”, the
file
is served, OK
When the URL is UTF-8 encoded, the request sent is : GET
/%C3%A9t%C3%A9-2008.jpg ----> nginx resolves this to
“été-2008.jpg”,
and the file is not served. (file not found)

I only spent about 5 minutes looking for this, so I could be totally
wrong:

In 0.8.53, src/http/ngx_http_parse.c:1220 appears to be the start of the
relevant code. On a quick scan, it looks like the percent-decoding is
hardcoded. (case sw_quoted, followed by case sw_quoted_second, inside a
switch loop)

Again, I am not too familiar with this source and only spent a few
minutes looking. So someone please correct me if I am mistaken.

helen

Posted at Nginx Forum:

On Wed, 20 Oct 2010 21:57:32 -0400, helen wrote:

I only spent about 5 minutes looking for this, so I could be totally
wrong:

In 0.8.53, src/http/ngx_http_parse.c:1220 appears to be the start of
the
relevant code. On a quick scan, it looks like the percent-decoding
is
hardcoded. (case sw_quoted, followed by case sw_quoted_second, inside
a
switch loop)

Sorry to reply to my own post, but it looks like I am wrong; that looks
like where %xx is decoded only (duh). I am still following the chain to
where this is passed to the OS, and I don’t have time to look further
now.

helen

Posted at Nginx Forum:

Hello Igor, hello all,

/%C3%A9t%C3%A9-2008.jpg ----> nginx resolves this to
The only (related to the question) difference between RFC2396 and
to do recoding between Latin1 and UTF-8. Note though that this
may lead to unexpected results: “/%C3%A9” may be Latin1 “/é” as
well as UTF-8 “/”.

Yes, it makes sense. But shouldn’t nginx assume a UTF-8 encoding instead
of assuming a Latin-1 one ? Since in the future all URI will adopt this
encoding method. IMO a request like GET /%C3A9t%C3A9-2008.jpg should
translate to /t-2008.jpg - and not the other way around, like it’s the
case currently.

Currently nginx assumes URLs are encoded in Latin1, whereas it should
assume they’re UTF-8 first. Don’t you think ?

Hello!

On Thu, Oct 21, 2010 at 04:25:52PM +0200, Pierre-Marie B. wrote:

[…]

Yes, it makes sense. But shouldn’t nginx assume a UTF-8 encoding
instead of assuming a Latin-1 one ? Since in the future all URI
will adopt this encoding method. IMO a request like GET
/%C3A9t%C3A9-2008.jpg should translate to /été-2008.jpg - and
not the other way around, like it’s the case currently.

Currently nginx assumes URLs are encoded in Latin1, whereas it
should assume they’re UTF-8 first. Don’t you think ?

nginx doesn’t assume anything, it just passes bytes it got in
request to filesystem.

Maxim D.

Hello!

On Thu, Oct 21, 2010 at 03:23:46AM +0200, Pierre-Marie B. wrote:

a home webserver that is primarily used to serve stuff like
the file is served, OK
Apache. Maybe they have that sort of fallback mechanism.
The only (related to the question) difference between RFC2396 and
RFC3986 is that later one recommends using UTF-8 for new URI
schemes. There is no ambiguity between the two: character set for
non-US-ASCII characters in http URLs isn’t defined (though most
browsers nowadays use UTF-8 by default).

The only solution is to provide correct URLs, i.e. already
encoded ones.

If you think that “fallback mechanism” is a good idea - you may
implement one with “try_files” directive and embedded perl module
to do recoding between Latin1 and UTF-8. Note though that this
may lead to unexpected results: “/%C3%A9” may be Latin1 “/é” as
well as UTF-8 “/é”.

Maxim D.

On Thu, Oct 21, 2010 at 8:57 AM, helen [email protected] wrote:

except that it works the exact reverse in my side. Are you sure the
filename for the file in the filesystem stored in utf-8 format?

setting LANG to en_US.UTF-8 may help. (eg. “LANG=en_US.UTF-8 ls” in a
bash shells)


O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

file
setting LANG to en_US.UTF-8 may help. (eg. “LANG=en_US.UTF-8 ls” in a
bash shells)

Thanks for the tip. I followed your advice and tried many locale
combinations today.

Unfortunately none of them helped. I can’t use UTF-8 as locale because
FreeBSD’s FFS has no support for multibyte filenames. So if I want the
system “ls” command to output “t-2008.jpg” and not something weird, I
have to use one of the 8-bit locales. Currently my LANG is
fr_FR.ISO8859-15 (same as Latin-1 plus the uro sign).

OK, let’s sum up :

  • nginx does no translation and the URL is directly passed as a request
    to the filesystem
  • the new standards say that URLs are going to be sent UTF-8 encoded
  • UTF-8 is a multibyte encoding scheme
  • my server’s filesystem support several encoding schemes but not
    multibyte ones, and thus it doesn’t support UTF-8.

I guess I’ll have to go down the painful URL rewrite way. What a pity…

I’m quite new to nginx. Could someone suggest me a config file syntax to
do this ?

Hello!

On Thu, Oct 21, 2010 at 11:09:59PM +0200, Pierre-Marie B. wrote:

[…]

not something weird, I have to use one of the 8-bit locales.
Currently my LANG is fr_FR.ISO8859-15 (same as Latin-1 plus the
€uro sign).

No, you haven’t. Though you have to create files under locale you
set. File names are just bytes, and locale defines charset which
will be used to output them.

OK, let’s sum up :

  • nginx does no translation and the URL is directly passed as a request to the
    filesystem

Correct.

  • the new standards say that URLs are going to be sent UTF-8 encoded

Not correct. It’s not what standards say, it’s just what modern
browsers usually do by default.

  • UTF-8 is a multibyte encoding scheme

Correct.

  • my server’s filesystem support several encoding schemes but
    not multibyte ones, and thus it doesn’t support UTF-8.

Not correct. You server is character set agnostic, as well as
nginx.

I guess I’ll have to go down the painful URL rewrite way. What a pity…

As I already explained - this isn’t going to help in all cases.
The only safe aproach is to use urlencoded links.

Maxim D.

On Fri, Oct 22, 2010 at 4:09 AM, Pierre-Marie B. [email protected]
wrote:

Thanks for the tip. I followed your advice and tried many locale combinations
today.

Unfortunately none of them helped. I can’t use UTF-8 as locale because FreeBSD’s
FFS has no support for multibyte filenames. So if I want the system “ls” command
to output “été-2008.jpg” and not something weird, I have to use one of the 8-bit
locales. Currently my LANG is fr_FR.ISO8859-15 (same as Latin-1 plus the €uro
sign).

except that I tried that in FreeBSD too.

To get proper ls output with utf-8 filename you have to:

  • ensure the filename is in utf-8
  • ensure shell locale is set to utf-8
  • ensure terminal locale is set to utf-8

One incorrect locale in the chain and you will get broken result.

Additionally, being able to see “été-2008.jpg” using LANG other than
UTF-8 means that the filename is not stored in UTF-8, shell locale is
not in UTF-8, and terminal locale is not in UTF-8.

In which setting ONLY shell locale to UTF-8 won’t help at all. After
setting terminal and shell locale to UTF-8 you must rename the file to
UTF-8.


O< ascii ribbon campaign - stop html mail - www.asciiribbon.org


Unfortunately none of them helped. I can’t use UTF-8 as locale because
FreeBSD’s FFS has no support for multibyte filenames. So if I want the system “ls”
command to output “t-2008.jpg” and not something weird, I have to use one of the
8-bit locales. Currently my LANG is fr_FR.ISO8859-15 (same as Latin-1 plus the uro
sign).

except that I tried that in FreeBSD too.

To get proper ls output with utf-8 filename you have to:

  • ensure the filename is in utf-8
  • ensure shell locale is set to utf-8
  • ensure terminal locale is set to utf-8

One incorrect locale in the chain and you will get broken result.

Could you clarify the last two points for me ?

I use this in /etc/profile (my shell is bash):

export LANG=fr_FR.UTF-8
export MM_CHARSET=UTF-8

According to the handbook there is no need to set LC_ALL and LC_CTYPE.
And my terminal is a ssh session (putty), thus I suppose I don’t need to
edit /etc/ttys.

Am I missing something ? Could you show me your env or tell me what else
I need to change ? It would help a lot. Thank you.

Am I missing something ? Could you show me your env or tell me what else I need
to change ? It would help a lot. Thank you.
OK, I found it. In Putty, configuration, Window, Translation --> assume
incoming data as UTF-8.

Now “ls” displays my UTF-8 filenames correctly.

Thanks everybody for your help :slight_smile: Case solved.