Upload UTF-8 encoded textfile

My Rails application (Rails 4.1, Ruby 2.1.1) offers the user to upload a
file. This file will then be parsed by the application, and after the
parsing is done, it is deleted from the upload area.

So far, I have the following:

In my upload form, I have

<%= file_field_tag :upload, {accept: 'text/plain', class:

‘file_upload’} %>

In my controller, params[:upload] contains an object of class Tempfile,
which is already opened for reading. I am using #readline to read
through this file.

The problem now is that the file has encoding utf-8, and as soon as
reading contains a character which isn’t also a 7-Bit ASCII character, I
get an exception.

BTW, I also tried the following approach (in my controller):

tempf=params[:upload]
tempf.set_encoding('BOM|UTF-8')

However, this caused an exception

code converter not found (UTF-8 to UTF-8)

which I find somewhat strange, because I can set the encoding in this
way for a file opened with File.open(…).

What is the best way to read an uploaded UTF-8 file?

I was already thinking along the following line: The Tempfile class also
has a method #path, which returns the path of the uploaded file. I could
create a File object by opening this path, specify utf8 when opening it,
and read from this.

However, since this problem must occur quite frequently, I wonder
whether there is a way (maybe in the file_field_tag) to tell Rails that
the Tempfile object should be opened as utf8 for reading. Is this
possible, or is there another good way to deal with this problem?

In this case, it is pretty certain that ever file will contain UTF-8
characters, and in general, I think the cases are few where we can
assume input to be represented by 7-bit-ASCII.

What I do not know for sure is whether or not the file will have a BOM,
but I think Ruby can figure this out automatically, when supplying the
“BOM” option on opening.

It would make sense to allow also file using different encoding, such as
UTF-16, but this is something I will have to deal with later.

The stackoverflow link you presented, doesn’t really answer my problem
though. It just describes how I can open an UTF-8 file, and this is
the workaround I’m using meanwhile (as outlined in my posting where I
say: “I could create a File object by opening…”.

What I would like to know is, whether there is a simpler way (since the
file, after all, is already opened when my controller is entered), and
in particular why set_encoding doesn’t work for my Tempfile object,
even though this would work well for a File object.

Gotcha. Is the file actually opened when the controller is entered?
(That’s
an honest question I’m interested in how that works coming as an upload
from a form) The way you’ve described, that I failed to understand the
first time, to me seems like the best way but I’d be interested to see
what
others have to say.

Sorry I couldn’t be of more help.

Maybe try
this?

Does it matter if every file is considered UTF-8 even if it never
contains
a UTF-8 character?

Yes, it is, as I found by trial-and-error. Note that the object is not
just a File, it is of class Tempfile. I think this is quite common when
working with a Tempfile object. To make a Tempfile threadsafe, you have
to combine the creation of the filename and the creation of the file
into one call (otherwise you have a race condition if another process
tries to create a tempfile in the same directory and by accident comes
up with the same name).

While I didn’t dive into the source code to see, how Rails is
implemented in this respect, it would be reasonable to assume, that for
the upload, a Tempfile object is created for read+write, the uploaded
file is written to it, and the file pointer is repositioned at the
beginning of the file, before it is handed over to the controller. Since
the uploading process can’t know anything about the encoding, the file
must have been opened as a binary file. That’s why I had the idea that I
just need to set the encoding to the desired value before starting to
read from the file.

Does setting

config.encoding = “utf-8”

in your config/application.rb help? You’d also need to add

encoding: UTF-8

to the top of your file.

I was reading
this

which seems to discuss this problem.

As far I understand this article, this related to Rails 3 and MySQL, and
how to use UTF8 encoded data everywhere. I don’t know about MySQL, but
Rails 4 and Ruby 2 with SQLite don’t suffer this problem: I didn’t have
any trouble, processing all kinds of Unicode characters with my
application, and processing the uploaded file also works fine, as long I
use my (not very elegant) trick to open it a second time with the
desired encoding.

It now occurs to me, that the question is maybe not Rails-specific, but
a general Ruby question - how to change the encoding of a Tempfile
object.

Colin,

That shows how to create a Tempfile with a given encoding but the
question
is when a user uploads a file through a form and Rails creates a
Tempfile
is there a way to indicate that it should always create those Tempfiles
with a default encoding such as UTF-8?

On 17 July 2014 15:42, Eric S. [email protected] wrote:

Colin,

That shows how to create a Tempfile with a given encoding but the question
is when a user uploads a file through a form and Rails creates a Tempfile is
there a way to indicate that it should always create those Tempfiles with a
default encoding such as UTF-8?

In that case it is a Rails specific issue, not a Ruby question as
suggested by the OP.

Colin

On 17 July 2014 10:22, Ronald F. [email protected] wrote:

object.
I have not been following this thread in detail, but [1] discusses how
to use the encoding option when creating a Tempfile.

[1]

Colin

In that case is there something like a simple config change to be made
in
say config/application.rb that tells Rails how to encode created
Tempfiles?
If not would that be something that could/should be added to the Rails
project itself?

OK, so after some digging. It seems that when you create your new File
object and set the encoding you may not need to read the Tempfile in its
entirety. You can create a new File object using the Tempfile.
File.new(my_temp_file, encoding: ‘utf-8’) and then use this file. It
should
be using the Tempfile and just creating a new pointer to that file with
a
new encoding. If you wanted to read the lines out individually and just
use
that original Tempfile you could use force_encoding(‘utf-8’) on each
line
to make sure it is converting them to utf-8.

Colin L. wrote in post #1152686:

On 17 July 2014 15:42, Eric S. [email protected] wrote:

That shows how to create a Tempfile with a given encoding but the question
is when a user uploads a file through a form and Rails creates a Tempfile is
there a way to indicate that it should always create those Tempfiles with a
default encoding such as UTF-8?

In that case it is a Rails specific issue, not a Ruby question as
suggested by the OP.

Indeed, you are right so far that it might be a Rails question. Still,
I wonder why (in general) it is not possible to change the encoding of
an existing (already open) Tempfile. Assuming that it is OK to “rewind”
the file, I don’t see a technical reason, why this is not possible.

I don’t think it would be a good idea to configure this on the Rails
side. Image the following scenario: We have a website, which allows
users to upload textfiles, the content of which will eventually go into
the database. Since we are generous about the encoding, we also provide
the user with a dropdown list to choose a suitable encoding.

When the user clicks the upload button, the controller gets the uploaded
file plus information about the encoding. Clearly, Rails can not
anticipate the encoding of the file. It just can upload the file
(binary), and provide the controller with an open file handle.

Now Ruby does have the set_encoding method for File, and Tempfile is-a
file, and set_encoding can be called - it just fails. We have nearly
everything in place. Now, if we can find out WHY set_encoding fails (and
this might be a generic Ruby question), we can find out what Rails (or
the programmer) can do to let things go smoothly…

Ronald

I think, the idea of first reading the lines, and then use
force_encoding on the strings, would not work for two reasons:

  1. As I have experienced, I already get the exception on the first byte
    which has the high-bit set (i.e. is not 7-bit ASCII)

  2. If a UTF8 encoded character contained 0x0d or 0x0a, reading the line
    without being aware of the encoding, would “split” the character into
    two parts.

In addition, this solution would not account for a BOM (unless I write
special logic to extract an optional BOM on the first line being read).
Although I have only files without BOM at the moment, it is likely that
sooner or later I will also have to support uploading of files which
contain a BOM.

Ronald