Forum: Ruby on Rails Converting uploaded HTML files into UTF8

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Eca15b2b601e7e577d38bd5210a753ac?d=identicon&s=25 carl (Guest)
on 2005-11-19 18:41
(Received via mailing list)
I'm writing a rails app that allows an admin to upload files that will
be searched by users.  These files may be in text or html format, and
frequently are in different charsets.  For uniform presentation, I'd
like to convert everything to UTF8.  However, I'm not sure how best to
detect the format the uploaded docs are in.  I noticed that the ruby
iconv library needs to know what format you are converting from.  Does
anyone have any good ways to detect a document's charset?

While we're on the topic of conversion, I've also written a simple
html_to_text conversion routine:

  def html_to_text(html)
    html.gsub!(/<\s*?script[^>]*?>.*?<\s*?\/script\s*?>/m, '')  #
remove javascript
    html.gsub!(/<[\/\!]*?[^<>]*?>/m, '')                        #
remove html tags
    html.gsub!(/([\r\n])[\s]+/m, '\1')                          #
remove white space
    html.gsub!(/&(quot|\#34);/m, '"')                           #
convert symbols
    html.gsub!(/&(amp|\#38);/m, '&')
    html.gsub!(/&(lt|\#60);/m, '<')
    html.gsub!(/&(gt|\#62);/m, '>')
    html.gsub!(/&(nbsp|\#160);/m, ' ')
    html.gsub!(/&(iexcl|\#161);/m, "\161")
    html.gsub!(/&(cent|\#162);/m, "\162")
    html.gsub!(/&(pound|\#163);/m, "\163")
    html.gsub!(/&(copy|\#169);/m, "\169")
    html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack('c') }
    html.strip!
    html
  end

Is there a better ruby library for this that tries to preserve
structure more, or should I just stick with this approach?

Thanks,
Carl
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 listbox (Guest)
on 2005-11-19 19:30
(Received via mailing list)
On 19-nov-2005, at 18:38, Carl Youngblood wrote:

> I'm writing a rails app that allows an admin to upload files that will
> be searched by users.  These files may be in text or html format, and
> frequently are in different charsets.  For uniform presentation, I'd
> like to convert everything to UTF8.  However, I'm not sure how best to
> detect the format the uploaded docs are in.  I noticed that the ruby
> iconv library needs to know what format you are converting from.  Does
> anyone have any good ways to detect a document's charset?
No, there is no such way. This is one of the reasons Unicode has been
designed
- one of it's special "perks" is that you can reasonably well detect
if a document is in Unicode
( the probability of the random sequence of bytes matching Unicode is
very low).

The only thing you _can_ do is to say whether a document IS in
Unicode (you have to know the form though) or not.
After that you need to fallback to a charset that you are most likely
to expect on input.

What I would do if I was you:

1. In the upload form, make a checkmark that says "This file uses
Unicode" - people who need to know DO know what it means
2. By default turn the checkmark on
3. When the file is uploaded convert it from ISO to Unicode if the
checkmark was unchecked. If the HTML document contains a charset
directive,
decode it according to this directive. You can also use the UTF-8
sanity regex - if the uploaded document matches it you can save it as
UTF without conversions.


>     html.gsub!(/&(quot|\#34);/m, '"')                           #
>     html.strip!
>     html
>   end
>
> Is there a better ruby library for this that tries to preserve
> structure more, or should I just stick with this approach?

Look in the tag removal routines in the Rails source. But I think we
currently don't have such a library - partly because it is unclear
how you can handle HTML specifics (such as tables etc.)

>     html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack('c') }

So you say that you want your documents to be unicode and you do
this? Strange.
Eca15b2b601e7e577d38bd5210a753ac?d=identicon&s=25 carl (Guest)
on 2005-11-19 20:19
(Received via mailing list)
Thanks for the tips.  I can't rely on the uploaded documents having
similar HTML format, so I convert them to plain text files first.  As
I understand it, even after converting to plaintext, it would still be
possible to have characters that didn't get displayed properly because
they used a different charset, so therefore I want to convert to UTF8
after stripping out all HTML tags.  Please let me know me if this is
an incorrect assumption.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 listbox (Guest)
on 2005-11-19 21:07
(Received via mailing list)
On 19-nov-2005, at 20:18, Carl Youngblood wrote:

> Thanks for the tips.  I can't rely on the uploaded documents having
> similar HTML format, so I convert them to plain text files first.  As
> I understand it, even after converting to plaintext, it would still be
> possible to have characters that didn't get displayed properly because
> they used a different charset, so therefore I want to convert to UTF8
> after stripping out all HTML tags.  Please let me know me if this is
> an incorrect assumption.
Hmm. No this assumption is not correct. FIrst you have to get your
text into a right encoding and then
perform transforms with it.

>
> On 11/19/05, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
>>>     html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack('c') }

What I meant when quoting this fragment is that this kind of
conversion will get you only ASCII
character. You shoudl use UTF pack ("U") to get all the chars.
This topic is locked and can not be replied to.