Won't display characters following '\267'

nik · June 23, 2009, 10:04pm

Hello!

I use MySQL and making sure it is UTF-8 and in my view the character
set is also UTF-8. But when I display the text whose input came from
either an antiword.exe or WIN32OLE output of a MS Word document in a
textarea. Text fail to show immediately after a strange character that
shows up in rails console as \267. And I went back to Word to see what
this is (looked it up by its position). And it is a dot sort of
floating in middle of the line. Sort of like how they display chapters
or whatever they call it of the Bible. like 12-7[dot]Matthrew

For example:
Rails Console:

doc=“This is a pipe, but \267 this is not a pipe”
HTML:

This is a pipe, but

It just sort of STOPS rendering the rest of the text.

I can’t possibly ask my clients to remove that so to convenient me. I
have been on a 38 hours hunt to try to find some solutions to it.

Some says remove all [^[:print:]] matches. Which I can do and find a
way to at least preserve the \n\r’s. But then again, I do want to
preserve also as much of the original document as possible. I mean,
what if they use umlauts the o with " on top.

Any ideas?

Thank You!

nik · June 23, 2009, 10:30pm

You could try…

require ‘iconv’

clean_str = Iconv.new(‘UTF-8//Ignore’, ‘UTF-8’).iconv(messy_str)

It doesn’t always work though… you might need to catch
Iconv::InvalidCharacter…

Worth a try though and has gotten me out of some of this mess with bad
source data.

nik · June 24, 2009, 1:14am

Thanks Phillip for your help!

I just tried it and it works great! It display that dot thing. But
then because all of my regular expressions did not account for these
characters and some fail at where these characters appear.

1 - What do I know even what the right question to ask is… But what
do you call \267 Is this that hex character business or octal,
decimal?

And 2 - Just like that character \267 or ‘dot’ as I call it, how can I
match it? And does it have a class name?

Lastly, 3 - and what charcode or other means can I systematically
identify the accentuated characters as in the accent grave in French.

Thank You!

nik · June 24, 2009, 5:53pm

On Jun 23, 2009, at 3:46 PM, Nik wrote:

Thanks Phillip for your help!

I just tried it and it works great! It display that dot thing. But
then because all of my regular expressions did not account for these
characters and some fail at where these characters appear.

1 - What do I know even what the right question to ask is… But what
do you call \267 Is this that hex character business or octal,
decimal?

It’s unicode. A multi-byte, but single character.

And 2 - Just like that character \267 or ‘dot’ as I call it, how can I
match it? And does it have a class name?

By matching the unicode via \267 yourself. This might give some
insight… Unidecode!

Lastly, 3 - and what charcode or other means can I systematically
identify the accentuated characters as in the accent grave in French.

If the charcode is over what… 127 then it’s not simple ASCII…

You might also find this plugin useful -
http://github.com/rsl/stringex/tree

It will try and turn all that stuff into simple ASCII. You’ll ose
the accents, etc, but that might be okay for what you’re doing.

nik · June 24, 2009, 5:49pm

You really need to translate the character encoding on that data -
Rails is assuming that it’s UTF-8, when (from your description of the
character) it’s either Windows-1252 or (possibly) ISO8859-1. Your
previous problem was the default UTF-8 parser giving up, as \267 (B7
hex) is only a valid UTF-8 character inside a multibyte sequence.

–Matt J.

nik · June 25, 2009, 12:07am

Hey Matt, thanks for your help!

Here’s what I do
work\ruby script/console

doc = c:\\antiword.exe c:\\test.doc

=>“\n This is a pipe \267 but this is not a pipe.\n\r”

Bakery.create(:description=>doc)

=> #<Bakery id: 55, created_at: “2009-06-24 18:01:03”, updated_at:
“2009-06-24 18:01:03”, description: “\n This is a pipe \267 but
this is not a pipe.\n\r”>

Then go to http://localhost:3000/bakeries/55, where show.html.erb is
simply

<%= @bakery.description %>

with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output:

   This is a pipe

That’s it, the entire process of what I do. I would want to try out
your solution of translating the character encoding. Could it be that
it is the same method as Phillip above suggested, by using Iconv? If
so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!

nik · June 25, 2009, 12:07am

Hey Matt, thanks for your help!

Here’s what I do
work\ruby script/console

doc = c:\\antiword.exe c:\\test.doc

=>“\n This is a pipe \267 but this is not a pipe.\n\r”

Bakery.create(:description=>doc)

=> #<Bakery id: 55, created_at: “2009-06-24 18:01:03”, updated_at:
“2009-06-24 18:01:03”, description: “\n This is a pipe \267 but
this is not a pipe.\n\r”>

Then go to http://localhost:3000/bakeries/55, where show.html.erb is
simply

<%= @bakery.description %>

with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output:

   This is a pipe

That’s it, the entire process of what I do. I would want to try out
your solution of translating the character encoding. Could it be that
it is the same method as Phillip above suggested, by using Iconv? If
so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!

nik · June 25, 2009, 6:18pm

Actually, doing some more digging, you should first try adding using
the -m switch to antiword - the docs claim that:

antiword.exe -m utf-8 c:\test.doc

should convert the character set correctly. If nothing else, it should
be easy to try out…

–Matt J.

nik · June 26, 2009, 10:39am

Hey Matt!

That saved the day for me. – I am terribly sorry to brought this
trouble up here. I did look for the documentation/reference/manual/
instruction on Google but some obscure links turned up instead. If I
learned anything at all from you all , it’d be for me to look first at
the dir of the app from now on.

Thank You! Case closed

nik · July 7, 2009, 11:46pm

Wow!

Someone else dealing with the exact same thing as me!

Matt: your suggestion to use the “-m utf-8” flag for antiword was
exactly the right solution. Conceptually it makes the most sense, too.
I.e.: “Convert this Word doc to UTF-8 and parse it into text” as the
first step. Much much nicer!

It’s good to know that Iconv could probably do the same thing later in
the process, but it’s nice to just handle it up-front and the
resulting String object is already UTF-8. Whee!

Thank you! (my solution was much less than 38 hours, primarily thanks
to this thread)

-Danimal

nik · July 27, 2009, 4:40am

Hey, I am glad that my little post helped!!