Converting uploaded HTML files into UTF8

carl · November 19, 2005, 6:41pm

I’m writing a rails app that allows an admin to upload files that will
be searched by users. These files may be in text or html format, and
frequently are in different charsets. For uniform presentation, I’d
like to convert everything to UTF8. However, I’m not sure how best to
detect the format the uploaded docs are in. I noticed that the ruby
iconv library needs to know what format you are converting from. Does
anyone have any good ways to detect a document’s charset?

While we’re on the topic of conversion, I’ve also written a simple
html_to_text conversion routine:

def html_to_text(html)
html.gsub!(/<\s*?script[^>]?>.?<\s*?/script\s*?>/m, ‘’) #
remove javascript
html.gsub!(/<[/!]?[^<>]?>/m, ‘’) #
remove html tags
html.gsub!(/([\r\n])[\s]+/m, ‘\1’) #
remove white space
html.gsub!(/&(quot|#34);/m, ‘"’) #
convert symbols
html.gsub!(/&(amp|#38);/m, ‘&’)
html.gsub!(/&(lt|#60);/m, ‘<’)
html.gsub!(/&(gt|#62);/m, ‘>’)
html.gsub!(/&(nbsp|#160);/m, ’ ')
html.gsub!(/&(iexcl|#161);/m, “\161”)
html.gsub!(/&(cent|#162);/m, “\162”)
html.gsub!(/&(pound|#163);/m, “\163”)
html.gsub!(/&(copy|#169);/m, “\169”)
html.gsub!(/&#(\d+);/m) {|s| [$1.to_i].pack(‘c’) }
html.strip!
html
end

Is there a better ruby library for this that tries to preserve
structure more, or should I just stick with this approach?

Thanks,
Carl

carl · November 19, 2005, 7:30pm

On 19-nov-2005, at 18:38, Carl Y. wrote:

I’m writing a rails app that allows an admin to upload files that will
be searched by users. These files may be in text or html format, and
frequently are in different charsets. For uniform presentation, I’d
like to convert everything to UTF8. However, I’m not sure how best to
detect the format the uploaded docs are in. I noticed that the ruby
iconv library needs to know what format you are converting from. Does
anyone have any good ways to detect a document’s charset?
No, there is no such way. This is one of the reasons Unicode has been
designed

one of it’s special “perks” is that you can reasonably well detect
if a document is in Unicode
( the probability of the random sequence of bytes matching Unicode is
very low).

The only thing you can do is to say whether a document IS in
Unicode (you have to know the form though) or not.
After that you need to fallback to a charset that you are most likely
to expect on input.

What I would do if I was you:

In the upload form, make a checkmark that says “This file uses
Unicode” - people who need to know DO know what it means
By default turn the checkmark on
When the file is uploaded convert it from ISO to Unicode if the
checkmark was unchecked. If the HTML document contains a charset
directive,
decode it according to this directive. You can also use the UTF-8
sanity regex - if the uploaded document matches it you can save it as
UTF without conversions.

html.gsub!(/&(quot|\#34);/m, '"')                           #
html.strip!
html
end

Is there a better ruby library for this that tries to preserve
structure more, or should I just stick with this approach?

Look in the tag removal routines in the Rails source. But I think we
currently don’t have such a library - partly because it is unclear
how you can handle HTML specifics (such as tables etc.)

html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack('c') }

So you say that you want your documents to be unicode and you do
this? Strange.

carl · November 19, 2005, 8:19pm

Thanks for the tips. I can’t rely on the uploaded documents having
similar HTML format, so I convert them to plain text files first. As
I understand it, even after converting to plaintext, it would still be
possible to have characters that didn’t get displayed properly because
they used a different charset, so therefore I want to convert to UTF8
after stripping out all HTML tags. Please let me know me if this is
an incorrect assumption.

carl · November 19, 2005, 9:07pm

On 19-nov-2005, at 20:18, Carl Y. wrote:

Thanks for the tips. I can’t rely on the uploaded documents having
similar HTML format, so I convert them to plain text files first. As
I understand it, even after converting to plaintext, it would still be
possible to have characters that didn’t get displayed properly because
they used a different charset, so therefore I want to convert to UTF8
after stripping out all HTML tags. Please let me know me if this is
an incorrect assumption.
Hmm. No this assumption is not correct. FIrst you have to get your
text into a right encoding and then
perform transforms with it.

On 11/19/05, Julian ‘Julik’ Tarkhanov [email protected] wrote:
html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack('c') }

What I meant when quoting this fragment is that this kind of
conversion will get you only ASCII
character. You shoudl use UTF pack (“U”) to get all the chars.