Error decoding input string

Hi,

I am trying to index a number of Spanish language text files, but a
large fraction of the files are generating errors like the
following…

Error: exception 2 not handled: Error decoding input string. Check that
you have the locale set correctly

however it looks to me like my locale matches the file type. Running
the file command on the files returns

$ file /media/…/raw/abc/20Jan2007_abc_001041_67.es
/media/…/raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text

and my locale is

$ locale
LANG=en_US.UTF-8
LC_CTYPE=“en_US.UTF-8”
LC_NUMERIC=“en_US.UTF-8”
LC_TIME=“en_US.UTF-8”
LC_COLLATE=“en_US.UTF-8”
LC_MONETARY=“en_US.UTF-8”
LC_MESSAGES=“en_US.UTF-8”
LC_PAPER=“en_US.UTF-8”
LC_NAME=“en_US.UTF-8”
LC_ADDRESS=“en_US.UTF-8”
LC_TELEPHONE=“en_US.UTF-8”
LC_MEASUREMENT=“en_US.UTF-8”
LC_IDENTIFICATION=“en_US.UTF-8”
LC_ALL=

after enough of these errors are generated, I begin to get errors for
having too many open files, and the indexing fails.

Error: exception 2 not handled: Too many open files

Any suggestions would be greatly appreciated.

Thanks,
Eric

Hi!

Are you sure this is all valid UTF8? I dont know how the file
command determines this, and if it always is right.
Maybe try to play around with iconv to ensure whatever you send to
Ferret really is UTF8.

Cheers,
Jens

On 19.05.2008, at 18:00, Eric S. wrote:

the file command on the files returns

$ file /media/…/raw/abc/20Jan2007_abc_001041_67.es
/media/…/raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text

LC_MONETARY=“en_US.UTF-8”
after enough of these errors are generated, I begin to get errors for
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk


Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

Hi Jens,

Thanks for the reply!

I used iconv (thanks for the pointer, I had no idea this tool existed)
and was able to convert all of the articles to and from utf8 without
any errors being generated, so I am pretty sure that the input sources
are valid utf8.

I should mention that I am using an old version of ferret. v.0.9.6
which is the last version to have a pure-ruby implementation. I’m
using this version because I have added in some changes which allow me
to specify the scoring algorithm used on a per-search basis. I
haven’t however made any changes to the indexing portion of the
application.

I current have an iconv script creating transliterated ASCII copies of
all my articles, so I am going to try to index over these. Also, I am
thinking of trying to index using Lucene since there is a chance that
the older version of ferret is compatible with lucene indexes.

If you have any other suggestions I’d love to hear them, but I
understand that I can’t expect much help with such an old version. Do
you know of a way to specify custom scoring algorithms in the current
versions of ferret?

Best,
Eric

On Monday, May 19, at 23:15, Jens K. wrote:

Hi!

Are you sure this is all valid UTF8? I dont know how the file
command determines this, and if it always is right.
Maybe try to play around with iconv to ensure whatever you send to
Ferret really is UTF8.

Cheers,
Jens

On 19.05.2008, at 18:00, Eric S. wrote:

Hi,

I am trying to index a number of Spanish language text files, but a
large fraction of the files are generating errors like the
following…

Error: exception 2 not handled: Error decoding input string. Check
that you have the locale set correctly

however it looks to me like my locale matches the file type.
Running
the file command on the files returns

$ file /media/…/raw/abc/20Jan2007_abc_001041_67.es
/media/…/raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text

and my locale is

$ locale
LANG=en_US.UTF-8
LC_CTYPE=“en_US.UTF-8”
LC_NUMERIC=“en_US.UTF-8”
LC_TIME=“en_US.UTF-8”
LC_COLLATE=“en_US.UTF-8”
LC_MONETARY=“en_US.UTF-8”
LC_MESSAGES=“en_US.UTF-8”
LC_PAPER=“en_US.UTF-8”
LC_NAME=“en_US.UTF-8”
LC_ADDRESS=“en_US.UTF-8”
LC_TELEPHONE=“en_US.UTF-8”
LC_MEASUREMENT=“en_US.UTF-8”
LC_IDENTIFICATION=“en_US.UTF-8”
LC_ALL=

after enough of these errors are generated, I begin to get errors
for
having too many open files, and the indexing fails.

Error: exception 2 not handled: Too many open files

Any suggestions would be greatly appreciated.

Thanks,
Eric


Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk


Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

Hi,

So I’ve tried switching to the latest version of Ferret (0.11.06), but
I am still getting the following errors.

,----
| Error: exception 2 not handled: Error decoding input string. Check that you have the locale set correctly
| from spanish_indexer.rb:45
| from spanish_indexer.rb:38:in each' | from spanish_indexer.rb:38----

The articles are recognized as valid utf8 using iconv, and I believe
my locale is set properly

,----
| LANG=en_US.UTF-8
| LC_CTYPE=“en_US.UTF-8”
| LC_NUMERIC=“en_US.UTF-8”
| LC_TIME=“en_US.UTF-8”
| LC_COLLATE=“en_US.UTF-8”
| LC_MONETARY=“en_US.UTF-8”
| LC_MESSAGES=“en_US.UTF-8”
| LC_PAPER=“en_US.UTF-8”
| LC_NAME=“en_US.UTF-8”
| LC_ADDRESS=“en_US.UTF-8”
| LC_TELEPHONE=“en_US.UTF-8”
| LC_MEASUREMENT=“en_US.UTF-8”
| LC_IDENTIFICATION=“en_US.UTF-8”
| LC_ALL=
`----

what’s weird here is that the errors don’t always happen on the same
articles, if I try to run indexing three times, printing out the
articles that throw this error, I get a different list of articles
each time.

In fact I just changed my indexing script so that it keeps trying to
index failed articles

,----
| # ind is my index
| #
| # add_arts is a method which takes a list of articles, tries to
| # index them, and returns a list of the articles that
| # threw errors during indexing
| #
| puts art_paths.size.to_s + “articles”
| missed = add_arts(art_paths, ind)
| while missed.size > 0
| missed = add_arts(missed, ind)
| puts missed.size
| end
`----

and I was able to index all of the articles with the following output

,----
| 5843 articles
| 34
| 16
| 10
| 9
| 7
| 7
| 6
| 1
| 0
`----

any ideas what could be causing this non-deterministic behavior?

Thanks,
Eric