[ActsAsFerret] OpenSolaris (TextDrive) indexing issues

Gents,

I successfully installed AAF on my TextDrive OpenSolaris Container, but
I’m having some issues with indexing.

I have a model called Blogs which has AAF enabled.

The first time I tried to find_by_contents for a ‘word’ I know was on
the Database I got now results. Apparently the index was not ready yet.
Then I waited a few hours and checked that the /index directory was
receiving no changes, so the indexing was not happening also.

Then I tried to re-index and I got the following error after a few hours
of work:

Blog.rebuild_index

IOError: IO Error occured at <except.c>:79 in xraise
Error occured in fs_store.c:324 - fs_open_input
couldn’t create InStream
script/…/config/…/config/…/index/production/blog/_73j.fdx:
from
/opt/csw/lib/ruby/gems/1.8/gems/ferret-0.10.14/lib/ferret/index.rb:273:in
delete' from /opt/csw/lib/ruby/gems/1.8/gems/ferret-0.10.14/lib/ferret/index.rb:273:in<<’
from /opt/csw/lib/ruby/1.8/monitor.rb:229:in synchronize' from /opt/csw/lib/ruby/gems/1.8/gems/ferret-0.10.14/lib/ferret/index.rb:256:in<<’
from
./script/…/config/…/config/…/vendor/plugins/acts_as_ferret/lib/class_methods.rb:199:in
rebuild_index' from ./script/../config/../config/../vendor/plugins/acts_as_ferret/lib/class_methods.rb:198:inrebuild_index’
from
./script/…/config/…/config/…/vendor/plugins/acts_as_ferret/lib/class_methods.rb:197:in
rebuild_index' from /opt/csw/lib/ruby/gems/1.8/gems/activerecord-1.14.4/lib/active_record/connection_adapters/abstract/database_statements.rb:51:intransaction’
from
/opt/csw/lib/ruby/gems/1.8/gems/activerecord-1.14.4/lib/active_record/transactions.rb:91:in
transaction' from ./script/../config/../config/../vendor/plugins/acts_as_ferret/lib/class_methods.rb:196:inrebuild_index’
from
./script/…/config/…/config/…/vendor/plugins/acts_as_ferret/lib/class_methods.rb:194:in
`rebuild_index’
from (irb):9

Again, it seems that the index is incomplete and is bringing partial
results.

Any suggestions on what to do?

PS:. During the indexing, there is nothing being queried on the DB,
actually the unique thing running on that DB was the console where I
runned the rebuild_index.

Thanks in advance.

Manoel L.

On Sun, Jan 21, 2007 at 09:32:25PM +0100, Manoel L. wrote:

receiving no changes, so the indexing was not happening also.

Then I tried to re-index and I got the following error after a few hours
of work:

does this mean it took a few hours for rebuilding the index, or did you
only start the rebuild after a few hours?

Blog.rebuild_index

IOError: IO Error occured at <except.c>:79 in xraise
Error occured in fs_store.c:324 - fs_open_input
couldn’t create InStream
script/…/config/…/config/…/index/production/blog/_73j.fdx:

strange. This does really look like the index has been modified by
something else while the rebuild was running. Could you try to start
over with a new, empty index directory?

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Jens, answering your questions:

  1. Yes, it took a few hours from the start of the rebuild_index and the
    failure.

  2. I don’t think that any other process is modifying the index folder,
    but I’ll try your suggestion. Cleaning the index folder and running
    rebuild_index again.

Thanks for the attention.

Jens K. wrote:

On Sun, Jan 21, 2007 at 09:32:25PM +0100, Manoel L. wrote:

receiving no changes, so the indexing was not happening also.

Then I tried to re-index and I got the following error after a few hours
of work:

does this mean it took a few hours for rebuilding the index, or did you
only start the rebuild after a few hours?

Blog.rebuild_index

IOError: IO Error occured at <except.c>:79 in xraise
Error occured in fs_store.c:324 - fs_open_input
couldn’t create InStream
script/…/config/…/config/…/index/production/blog/_73j.fdx:

strange. This does really look like the index has been modified by
something else while the rebuild was running. Could you try to start
over with a new, empty index directory?

Jens


webit! Gesellschaft f�r neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr�mer [email protected]
Schnorrstra�e 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Jens,

Maybe you are correct. Actually my Rails application was UP.
I mean, while I was running Blog.rebuild_index on the console, the Rails
app was running.

Is this the kind of simultaneous modification of the index that you
talked about?

If yes, how will Ferret and Acts-As-Ferret behave in a real life
situation where we have several Mongrels running the Rails application?
Is this a problem?

The Blog.rebuild_index is running, I’ll let you know the results (now
with only the console running).

Thanks for the help.

Sincerely,

Manoel L.

On Mon, Jan 22, 2007 at 12:25:13PM +0100, Manoel L. wrote:

Jens,

Maybe you are correct. Actually my Rails application was UP.
I mean, while I was running Blog.rebuild_index on the console, the Rails
app was running.

Is this the kind of simultaneous modification of the index that you
talked about?

exactly.

If yes, how will Ferret and Acts-As-Ferret behave in a real life
situation where we have several Mongrels running the Rails application?
Is this a problem?

It should not, since Ferret is supposed to have a file system based
locking that manages inter-process synchronisation.

However it doesn’t seem to be reliable under certain circumstances

  • the usual workaround is to use a backgroundrb process that does all
    the indexing, and only do the searching inside the mongrels.
    Unfortunately aaf does not support this kind of remote-indexing yet,
    but it is definitely on my list.

The Blog.rebuild_index is running, I’ll let you know the results (now
with only the console running).

Sounds like you index a whole Farm of Blogs - I’m still wondering about
the reason for the long indexing time :wink:

cheers,
Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

On Mon, Jan 22, 2007 at 11:59:06AM +0100, Manoel L. wrote:

Jens, answering your questions:

  1. Yes, it took a few hours from the start of the rebuild_index and the
    failure.

wow, either that machine is really slow or you have an enormous amount
of data to index…

or something really weird is going on there.

  1. I don’t think that any other process is modifying the index folder,
    but I’ll try your suggestion. Cleaning the index folder and running
    rebuild_index again.

let us know how it works out.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Jens,

In fact, I’m indexing around 150K blogs, my app is a Blog/Posts indexing
service, just like Technorati, but focused on the Brazilian blogosphere.

Same error, even with only the console running Blog.rebuild_index, see:

/opt/csw/lib/ruby/gems/1.8/gems/rails-1.1.6/lib/commands/runner.rb:27:
/opt/csw/lib/ruby/gems/1.8/gems/ferret-0.10.14/lib/ferret/ind
ex.rb:273:in `delete’: IO Error occured at <except.c>:79 in xraise
(IOError)
Error occured in fs_store.c:324 - fs_open_input
couldn’t create InStream
script/…/config/…/index/production/blog/_3pe.fdx:

    from 

/opt/csw/lib/ruby/gems/1.8/gems/ferret-0.10.14/lib/ferret/index.rb:273:in
<<' from /opt/csw/lib/ruby/1.8/monitor.rb:229:insynchronize’
from
/opt/csw/lib/ruby/gems/1.8/gems/ferret-0.10.14/lib/ferret/index.rb:256:in
<<' from ./script/../config/../vendor/plugins/acts_as_ferret/lib/class_methods.rb:199:inrebuild_index’
from
./script/…/config/…/vendor/plugins/acts_as_ferret/lib/class_methods.rb:198:in
rebuild_index' from ./script/../config/../vendor/plugins/acts_as_ferret/lib/class_methods.rb:197:inrebuild_index’
from
/opt/csw/lib/ruby/gems/1.8/gems/activerecord-1.14.4/lib/active_record/connection_adapters/abstract/database_statements.
rb:51:in transaction' from /opt/csw/lib/ruby/gems/1.8/gems/activerecord-1.14.4/lib/active_record/transactions.rb:91:intransaction’
from
./script/…/config/…/vendor/plugins/acts_as_ferret/lib/class_methods.rb:196:in
rebuild_index' from ./script/../config/../vendor/plugins/acts_as_ferret/lib/class_methods.rb:194:inrebuild_index’
from (eval):1
from
/opt/csw/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:21:in eval' from /opt/csw/lib/ruby/gems/1.8/gems/rails-1.1.6/lib/commands/runner.rb:27 from /opt/csw/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:21:inrequire’
from
/opt/csw/lib/ruby/gems/1.8/gems/activesupport-1.3.1/lib/active_support/dependencies.rb:147:in
`require’
from ./script/runner:3

Suggestions?

Any thing else I can do to gather more debug data?

[]s

Manoel

Jens,

The content of my app/index/production/blog directory is:
(just after the exception on Blog.rebuild_index)

[92140-AA:~/web/labs/blogblogs/trunk/index/production/blog] pocscom$ ls
-al > index.txt
[92140-AA:~/web/labs/blogblogs/trunk/index/production/blog] pocscom$
more index.txt
total 53451
drwxr-xr-x 2 pocscom pocscom 40 Jan 22 15:32 ./
drwxr-xr-x 3 pocscom pocscom 3 Jan 22 09:00 …/
-rw------- 1 pocscom pocscom 6.3M Jan 22 10:18 _1pp.cfs
-rw------- 1 pocscom pocscom 6.2M Jan 22 11:00 _2kk.cfs
-rw------- 1 pocscom pocscom 6.9M Jan 22 11:44 _3ff.cfs
-rw------- 1 pocscom pocscom 143K Jan 22 11:48 _3ii.cfs
-rw------- 1 pocscom pocscom 260K Jan 22 11:51 _3ll.cfs
-rw------- 1 pocscom pocscom 893K Jan 22 11:56 _3oo.cfs
-rw------- 1 pocscom pocscom 82K Jan 22 11:56 _3oz.cfs
-rw------- 1 pocscom pocscom 42K Jan 22 11:56 _3pa.cfs
-rw------- 1 pocscom pocscom 1 Jan 22 11:57 _3pe.f0
-rw------- 1 pocscom pocscom 1 Jan 22 11:57 _3pe.f1
-rw------- 1 pocscom pocscom 1 Jan 22 11:57 _3pe.f2
-rw------- 1 pocscom pocscom 1 Jan 22 11:57 _3pe.f3
-rw------- 1 pocscom pocscom 1 Jan 22 11:57 _3pe.f4
-rw------- 1 pocscom pocscom 1 Jan 22 11:57 _3pe.f5
-rw------- 1 pocscom pocscom 5 Jan 22 11:57 _3pe.frq
-rw------- 1 pocscom pocscom 5 Jan 22 11:57 _3pe.prx
-rw------- 1 pocscom pocscom 37 Jan 22 11:57 _3pe.tfx
-rw------- 1 pocscom pocscom 100 Jan 22 11:57 _3pe.tis
-rw------- 1 pocscom pocscom 24 Jan 22 11:57 _3pe.tix
-rw------- 1 pocscom pocscom 226 Jan 22 11:57 _3pe.tmp
-rw------- 1 pocscom pocscom 5.5K Jan 22 11:57 _3pl.cfs
-rw------- 1 pocscom pocscom 44K Jan 22 11:57 _3pw.cfs
-rw------- 1 pocscom pocscom 84K Jan 22 11:57 _3q7.cfs
-rw------- 1 pocscom pocscom 85K Jan 22 11:57 _3qi.cfs
-rw------- 1 pocscom pocscom 1.1K Jan 22 11:57 _3qj.cfs
-rw------- 1 pocscom pocscom 2.2K Jan 22 11:57 _3qk.cfs
-rw------- 1 pocscom pocscom 767 Jan 22 11:57 _3ql.cfs
-rw------- 1 pocscom pocscom 550 Jan 22 11:57 _3qm.cfs
-rw------- 1 pocscom pocscom 684 Jan 22 11:57 _3qn.cfs
-rw------- 1 pocscom pocscom 949 Jan 22 11:57 _3qo.cfs
-rw------- 1 pocscom pocscom 776 Jan 22 11:57 _3qp.cfs
-rw------- 1 pocscom pocscom 1.1K Jan 22 11:57 _3qq.cfs
-rw------- 1 pocscom pocscom 40K Jan 22 11:57 _3qr.cfs
-rw------- 1 pocscom pocscom 4.4M Jan 22 09:38 _uu.cfs
-rw------- 1 pocscom pocscom 114 Jan 22 11:57 _uu.del
-rw------- 1 pocscom pocscom 79 Jan 22 11:57 fields
-rw------- 1 pocscom pocscom 156 Jan 22 11:57 segments

On 1/23/07, Jens K. [email protected] wrote:

It should not, since Ferret is supposed to have a file system based
locking that manages inter-process synchronisation.

(a bit OT, but since it was mentioned…)

As in managing simultaneous writes as well?

Reason I’m asking is, I wrote an app a few months ago which is a
networked index that is supposed to handle multiple “clients” writing
to the index at the same time. What I did was to write a class that
queued those requests and dispatched them one at a time, since
otherwise, the server would crash because of Ferret locking issues.
That was around Ferret 0.9.3 or so.

I understand I could flush the index every time I insert something,
but that’s too much of a cost in terms of performance that I can’t
afford…

On Mon, Jan 22, 2007 at 06:22:27PM +0100, Manoel L. wrote:

Jens,

In fact, I’m indexing around 150K blogs, my app is a Blog/Posts indexing
service, just like Technorati, but focused on the Brazilian blogosphere.

Same error, even with only the console running Blog.rebuild_index, see:

I really can’t imagine why this should happen with only one process
accessing the index.

do you have the possiblity to try this out on some other platform (i.e.,
Linux)?

Jens

webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Jens,

Any idea on my issue?

I still cannot complete the indexing rebuild.
All the times I try it I got the same error (but in different files).
Now I’m totally sure that only a unique process (console) is running.

[]s

Manoel

On Tue, Jan 23, 2007 at 09:34:09AM +1100, Julio Cesar O. wrote:

On 1/23/07, Jens K. [email protected] wrote:

It should not, since Ferret is supposed to have a file system based
locking that manages inter-process synchronisation.

(a bit OT, but since it was mentioned…)

As in managing simultaneous writes as well?

The locking is supposed to prevent simultaneous writing. Afair Ferret
internally waits some time and then retries the write, throwing an error
if it still doesn’t succeed.

Reason I’m asking is, I wrote an app a few months ago which is a
networked index that is supposed to handle multiple “clients” writing
to the index at the same time. What I did was to write a class that
queued those requests and dispatched them one at a time, since
otherwise, the server would crash because of Ferret locking issues.
That was around Ferret 0.9.3 or so.

I’d still go this route to make sure the index stays sane, especially
with a heavily loaded app.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Jens,

I think I found it (dumb), hehe.
I just saw that I exceeded my disk quota.
I cleared a few Gigs and I’m waiting the indexing.

Lets see…

[]s

Manoel

Jens,

Seems that the problem was really councurring process building the index
at the same time. I was not aware that I had some runner process on the
Cron.

Now I’m running Blog.rebuild_index really alone, and no failures until
now.
The crazy thing is, 19 HOURS of CPU already and we are far from ending I
think.
I don’t what a completed index seems to be, but the file names give me
an idea of the progress.

TOP Result:

load averages: 3.25, 4.62, 5.88
14:09:29
73 processes: 71 sleeping, 2 on cpu
CPU states: 19.8% idle, 62.1% user, 18.1% kernel, 0.0% iowait, 0.0%
swap
Memory: 16G real, 2053M free, 7520M swap in use, 21G swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
5737 pocscom 1 59 0 36M 32M sleep 19:38 0.32% runner

Current contents of app/index/production/blog:

[92140-AA:~/web/labs/blogblogs/trunk/index/production/blog] pocscom$ ls
-al
total 349436
drwxr-xr-x 2 pocscom pocscom 34 Jan 24 14:10 ./
drwxr-xr-x 3 pocscom pocscom 3 Jan 23 10:40 …/
-rw------- 1 pocscom pocscom 53M Jan 23 17:25 _8km.cfs
-rw------- 1 pocscom pocscom 45M Jan 24 00:03 _h59.cfs
-rw------- 1 pocscom pocscom 35M Jan 24 06:20 _ppw.cfs
-rw------- 1 pocscom pocscom 2.6M Jan 24 06:57 _qkr.cfs
-rw------- 1 pocscom pocscom 1000K Jan 24 07:34 _rfm.cfs
-rw------- 1 pocscom pocscom 2.2M Jan 24 08:12 _sah.cfs
-rw------- 1 pocscom pocscom 4.5M Jan 24 08:50 _t5c.cfs
-rw------- 1 pocscom pocscom 4.0M Jan 24 09:32 _u07.cfs
-rw------- 1 pocscom pocscom 4.9M Jan 24 10:16 _uv2.cfs
-rw------- 1 pocscom pocscom 4.3M Jan 24 11:27 _vpx.cfs
-rw------- 1 pocscom pocscom 3.3M Jan 24 12:24 _wks.cfs
-rw------- 1 pocscom pocscom 5.2M Jan 24 13:32 _xfn.cfs
-rw------- 1 pocscom pocscom 968K Jan 24 13:38 _xiq.cfs
-rw------- 1 pocscom pocscom 656K Jan 24 13:45 _xlt.cfs
-rw------- 1 pocscom pocscom 226K Jan 24 13:50 _xow.cfs
-rw------- 1 pocscom pocscom 655K Jan 24 13:54 _xrz.cfs
-rw------- 1 pocscom pocscom 457K Jan 24 13:58 _xv2.cfs
-rw------- 1 pocscom pocscom 575K Jan 24 14:03 _xy5.cfs
-rw------- 1 pocscom pocscom 459K Jan 24 14:07 _y18.cfs
-rw------- 1 pocscom pocscom 82K Jan 24 14:07 _y1j.cfs
-rw------- 1 pocscom pocscom 42K Jan 24 14:08 _y1u.cfs
-rw------- 1 pocscom pocscom 42K Jan 24 14:08 _y25.cfs
-rw------- 1 pocscom pocscom 3.6K Jan 24 14:09 _y2g.cfs
-rw------- 1 pocscom pocscom 2.5K Jan 24 14:09 _y2r.cfs
-rw------- 1 pocscom pocscom 121K Jan 24 14:10 _y32.cfs
-rw------- 1 pocscom pocscom 584 Jan 24 14:10 _y33.cfs
-rw------- 1 pocscom pocscom 593 Jan 24 14:10 _y34.cfs
-rw------- 1 pocscom pocscom 94 Jan 24 14:10 _y35.fdt
-rw------- 1 pocscom pocscom 0 Jan 24 14:10 _y35.fdx
-rw------- 1 pocscom pocscom 0 Jan 23 10:40 ferret-write.lck
-rw------- 1 pocscom pocscom 79 Jan 24 14:10 fields
-rw------- 1 pocscom pocscom 195 Jan 24 14:10 segments

Hello I have a couple of questions, hope someone can help answer them

  1. how do you know when the indexing is over and complete?
  2. how can you confirm that ALL records in the table were indexed?
    (especially for really large tables with millions of records)