Converting uploaded HTML files into UTF8


#1

I’m writing a rails app that allows an admin to upload files that will
be searched by users. These files may be in text or html format, and
frequently are in different charsets. For uniform presentation, I’d
like to convert everything to UTF8. However, I’m not sure how best to
detect the format the uploaded docs are in. I noticed that the ruby
iconv library needs to know what format you are converting from. Does
anyone have any good ways to detect a document’s charset?

While we’re on the topic of conversion, I’ve also written a simple
html_to_text conversion routine:

def html_to_text(html)
html.gsub!(/<\s*?script[^>]?>.?<\s*?/script\s*?>/m, ‘’) #
remove javascript
html.gsub!(/<[/!]?[^<>]?>/m, ‘’) #
remove html tags
html.gsub!(/([\r\n])[\s]+/m, ‘\1’) #
remove white space
html.gsub!(/&(quot|#34);/m, ‘"’) #
convert symbols
html.gsub!(/&(amp|#38);/m, ‘&’)
html.gsub!(/&(lt|#60);/m, ‘<’)
html.gsub!(/&(gt|#62);/m, ‘>’)
html.gsub!(/&(nbsp|#160);/m, ’ ')
html.gsub!(/&(iexcl|#161);/m, “\161”)
html.gsub!(/&(cent|#162);/m, “\162”)
html.gsub!(/&(pound|#163);/m, “\163”)
html.gsub!(/&(copy|#169);/m, “\169”)
html.gsub!(/&#(\d+);/m) {|s| [$1.to_i].pack(‘c’) }
html.strip!
html
end

Is there a better ruby library for this that tries to preserve
structure more, or should I just stick with this approach?

Thanks,
Carl


#2

On 19-nov-2005, at 18:38, Carl Y. wrote:

I’m writing a rails app that allows an admin to upload files that will
be searched by users. These files may be in text or html format, and
frequently are in different charsets. For uniform presentation, I’d
like to convert everything to UTF8. However, I’m not sure how best to
detect the format the uploaded docs are in. I noticed that the ruby
iconv library needs to know what format you are converting from. Does
anyone have any good ways to detect a document’s charset?
No, there is no such way. This is one of the reasons Unicode has been
designed

  • one of it’s special “perks” is that you can reasonably well detect
    if a document is in Unicode
    ( the probability of the random sequence of bytes matching Unicode is
    very low).

The only thing you can do is to say whether a document IS in
Unicode (you have to know the form though) or not.
After that you need to fallback to a charset that you are most likely
to expect on input.

What I would do if I was you:

  1. In the upload form, make a checkmark that says “This file uses
    Unicode” - people who need to know DO know what it means
  2. By default turn the checkmark on
  3. When the file is uploaded convert it from ISO to Unicode if the
    checkmark was unchecked. If the HTML document contains a charset
    directive,
    decode it according to this directive. You can also use the UTF-8
    sanity regex - if the uploaded document matches it you can save it as
    UTF without conversions.
html.gsub!(/&(quot|\#34);/m, '"')                           #
html.strip!
html

end

Is there a better ruby library for this that tries to preserve
structure more, or should I just stick with this approach?

Look in the tag removal routines in the Rails source. But I think we
currently don’t have such a library - partly because it is unclear
how you can handle HTML specifics (such as tables etc.)

html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack('c') }

So you say that you want your documents to be unicode and you do
this? Strange.


#3

Thanks for the tips. I can’t rely on the uploaded documents having
similar HTML format, so I convert them to plain text files first. As
I understand it, even after converting to plaintext, it would still be
possible to have characters that didn’t get displayed properly because
they used a different charset, so therefore I want to convert to UTF8
after stripping out all HTML tags. Please let me know me if this is
an incorrect assumption.


#4

On 19-nov-2005, at 20:18, Carl Y. wrote:

Thanks for the tips. I can’t rely on the uploaded documents having
similar HTML format, so I convert them to plain text files first. As
I understand it, even after converting to plaintext, it would still be
possible to have characters that didn’t get displayed properly because
they used a different charset, so therefore I want to convert to UTF8
after stripping out all HTML tags. Please let me know me if this is
an incorrect assumption.
Hmm. No this assumption is not correct. FIrst you have to get your
text into a right encoding and then
perform transforms with it.

On 11/19/05, Julian ‘Julik’ Tarkhanov removed_email_address@domain.invalid wrote:

html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack('c') }

What I meant when quoting this fragment is that this kind of
conversion will get you only ASCII
character. You shoudl use UTF pack (“U”) to get all the chars.