Newbie question: 28000+ files for 25000+ records?

jefers · October 12, 2006, 5:46pm

Hi

Obviously my question is, is that normal? To have so many files? I was
indexing 6 string fields from 25000+ model records (all of the same
model). The index appears to be working. I guess I was expecting a few
hundred files after optimzing, not more files that records indexed.

Please understand I am brand spanking new to Lucene, Ferret, and AaF.

I was using acts_as_ferret with

:fields => [“user_id”,
“answer1”,
“answer2”,
“answer3”,
“answer4”,
“answer5”,
“answer6”],
:merge_factor => 1000,
:max_merge_document = 10000,
:max_memory_buffer =>0x4000000

The fields are from 15 to 500 characters long.

Also, was there any way to stop AaF from trying to create a new index
with all the existing model data? I was surprised when after creating
and updating one model object in the Rails Console, AaF took off trying
to index all 8 million rows of the underlying table!

I did search here on ‘too many files’ and “large number of files” but
came up empty. I am sure my lack of domain knowledge is most likely what
is hurting me.

Thanks
Jeff

jefers · October 18, 2006, 10:39am

On 10/13/06, Jeff G. [email protected] wrote:

Hi

Obviously my question is, is that normal? To have so many files? I was
indexing 6 string fields from 25000+ model records (all of the same
model). The index appears to be working. I guess I was expecting a few
hundred files after optimzing, not more files that records indexed.

Hi Jeff, this doesn’t sound right at all. Could send a partial listing
of the directory so I can see what files are in it? Do ls -l so I
can see their sizes too.

Please understand I am brand spanking new to Lucene, Ferret, and AaF.

No problem, we’re here to help.

        :max_merge_document = 10000,
        :max_memory_buffer =>0x4000000
The fields are from 15 to 500 characters long.

Also, was there any way to stop AaF from trying to create a new index
with all the existing model data? I was surprised when after creating
and updating one model object in the Rails Console, AaF took off trying
to index all 8 million rows of the underlying table!

I’ll leave these kinds of questions to the acts_as_ferret users.

Cheers,
Dave

jefers · October 18, 2006, 10:40am

On Fri, Oct 13, 2006 at 09:40:23AM +0900, David B. wrote:

On 10/13/06, Jeff G. [email protected] wrote:
[…]

Also, was there any way to stop AaF from trying to create a new
index with all the existing model data? I was surprised when after
creating and updating one model object in the Rails Console, AaF
took off trying to index all 8 million rows of the underlying table!

aaf always tries to create the index if it doesn’t exist yet. The whole
point about aaf is to keep the index in sync with your database.
Therefore it is necessary to add all existing records to a newly created
index.

Although it would be easy to add an option to suppress the indexing of
existing data, I don’t think this is useful, because you’ll end up with
an index only containing new or updated records, but not those that
already existed at index creation time. I can’t imagine this is what you
want

To keep the index creation from happening when the index is accessed the
first time from your app (could be a search, or some update/create
operation), you can build up the index from the console, i.e.

RAILS_ENV=production script/console

Model.rebuild_index

cheers,
Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

jefers · October 20, 2006, 1:22am

Jens K. wrote:

To keep the index creation from happening when the index is accessed the
first time from your app (could be a search, or some update/create
operation), you can build up the index from the console, i.e.

RAILS_ENV=production script/console

Model.rebuild_index

cheers,
Jens

Thank you Jens. While all you say is true, the original rowset was over
8.000.000 rows and would have taken days or more. I just wanted to do
some experimentation to see if my code would work. AaF is not that well
documented (well perhaps for those smarter than I) and therefore I
thought my best best was to play with it in the console. Little did I
realize it would go off and index the table to start.

And while you are correct that most of the time, you want an index,
really there are use cases where you only want to index data from that
time forward. Anyway I created a much smaller table and worked with it
to start. However it created 28.000 files for 25.000 records. Still not
quite right. But it does work in that I can search it.

BTW: Is there a method to say only return fields from the documents that
matched and not all the fields of documents that had matches? Of course
I did my own filter.

Best wishes and thank you for the advice and counsel.
Jeff

jefers · October 20, 2006, 8:15am

On 10/20/06, Jeff G. [email protected] wrote:

Jens
really there are use cases where you only want to index data from that
time forward.

That may be true but I don’t think the goal of acts_as_ferret should
be to cover all possibly use cases. It’s job is to make using Ferret
with ActiveRecord as easy as possible. If you need to do anything more
complicated than usual then why not just use Ferret directly?

Anyway I created a much smaller table and worked with it
to start. However it created 28.000 files for 25.000 records. Still not
quite right. But it does work in that I can search it.

There is something very wrong there but I have no idea what the
problem. For some reason Ferret doesn’t seem to be merging the index
segments (judging by your following email).

BTW: Is there a method to say only return fields from the documents that
matched and not all the fields of documents that had matches? Of course
I did my own filter.

Ferret documents are lazy loading so only the fields that you view get
loaded. However, there is currently no way to find out which fields
matched.

Cheers,
Dave

jefers · October 20, 2006, 1:15am

David B. wrote:

On 10/13/06, Jeff G. [email protected] wrote:

Hi Jeff, this doesn’t sound right at all. Could send a partial listing
of the directory so I can see what files are in it? Do ls -l so I

Below is a very very very partial listing. My env is Windows XP Pro. THe
verions of gems is listed below as well. Basically I accessed the first
model object and said model.save! to kick off the indexing. Which it
did. BTW: this is SQLServer if it matters. BTW: The searching the index
works. well…

I found out when I asked to highlight() that I never get anything back.
Looking at the soruce code and my fields I find I must have had (or
defaulted) to :store=>no so I have to retrieve the row, iterate myself
over the fields to find out which field matched, and then display the
results. That is not pretty but I have to admit, it’s painless. Still
25,000 records made 28,000+ files. Can you imagine all 8.1 million
records!! Is it because one of the fields being indexed is always unique
(think User ID/Primary key)?

I was going to trying it in Lucene and see what happens. I figure if it
is different, the must be doing something odd in Ferret/AaF. Plus I can
try native ferret to create the index and forego AaF for the initial
index creation (assuming that is a ‘fix’).

Thank you for any time and effort. I am becoming quite a
Ruby/Rails/Ferret fan for prototyping. I can say as I am ready for Rails
on my production envronoment hosting 40k logged in users a night, but
it’s wonderful for concept exploration.

Here is the partial listing (they are representational of all the other
files except for the last two of which they are the only ones). After
the listing is my gems versions

10/11/2006 08:23 PM 1,300 _z.cfs
10/11/2006 08:25 PM 1,314 _z0.cfs
10/11/2006 08:25 PM 1,705 _z1.cfs
10/11/2006 08:26 PM 3,039 _z2.cfs
10/11/2006 08:26 PM 970 _z3.cfs
10/11/2006 08:26 PM 3,015 _z4.cfs
10/11/2006 08:26 PM 14,266 _z5.cfs
10/11/2006 08:26 PM 770 _z6.cfs
10/11/2006 08:26 PM 815 _z7.cfs
10/11/2006 08:26 PM 1,150 _z8.cfs
10/11/2006 08:26 PM 1,564 _z9.cfs
10/11/2006 08:26 PM 2,283 _za.cfs
10/11/2006 08:26 PM 1,259 _zb.cfs
10/11/2006 08:26 PM 1,598 _zc.cfs
10/11/2006 08:26 PM 1,655 _zd.cfs
10/11/2006 08:26 PM 5,466 _ze.cfs
10/11/2006 08:26 PM 1,242 _zf.cfs
10/11/2006 08:26 PM 13,609 _zg.cfs
10/11/2006 08:26 PM 2,081 _zh.cfs
10/11/2006 08:26 PM 1,101 _zi.cfs
10/11/2006 08:26 PM 1,053 _zj.cfs
10/11/2006 08:26 PM 2,208 _zk.cfs
10/11/2006 08:26 PM 920 _zl.cfs
10/11/2006 08:26 PM 3,003 _zm.cfs
10/11/2006 08:26 PM 2,148 _zn.cfs
10/11/2006 08:26 PM 1,195 _zo.cfs
10/11/2006 08:26 PM 1,707 _zp.cfs
10/11/2006 08:26 PM 1,747 _zq.cfs
10/11/2006 08:26 PM 12,889 _zr.cfs
10/11/2006 08:26 PM 2,531 _zs.cfs
10/11/2006 08:26 PM 1,359 _zt.cfs
10/11/2006 08:26 PM 2,330 _zu.cfs
10/11/2006 08:26 PM 1,793 _zv.cfs
10/11/2006 08:26 PM 1,788 _zw.cfs
10/11/2006 08:26 PM 3,135 _zx.cfs
10/11/2006 08:26 PM 2,603 _zy.cfs
10/11/2006 08:26 PM 2,210 _zz.cfs
10/12/2006 08:39 AM 213 fields
10/12/2006 08:40 AM 29 segments
28381 File(s) 261,021,758 bytes
2 Dir(s) 35,192,127,488 bytes fre

actionmailer (1.2.5), actionpack (1.12.5), actionwebservice (1.1.6)
activerecord (1.14.4), activesupport (1.3.1), ferret (0.10.9),
fxri (0.3.3), fxruby (1.6.2, 1.6.1, 1.6.0, 1.2.6), gem_plugin (0.2.1)
log4r (1.0.5), mongrel (0.3.13.3)
rails (1.1.6), rake (0.7.1)
sources (0.0.1), win32-clipboard (0.4.1, 0.4.0)
win32-dir (0.3.0)
win32-eventlog (0.4.2, 0.4.1)
win32-file (0.5.2)
win32-file-stat (1.2.2)
win32-process (0.5.1, 0.4.2)
win32-sapi (0.1.3)
win32-service (0.5.0)
win32-sound (0.4.0)
windows-pr (0.5.4, 0.5.1)

jefers · October 20, 2006, 6:08pm

Thanks everyone for you help. Of source I meant no disrespect about AaF
indexing tables as soon as it discovers there is none. I am quite
appreciative of the Plugin as it is. Please don’t think I don’t
appreciate it.

And I really thank you all for the help

C:\ruby\omi\index\development\string_answer>od -c segments
0000000 \0 \0 \0 \0 \0 \0 \0 \0 E . # Â· \0 \0 \0 \0
0000020 \0 \0 n Â¦ 001 004 _ l w a + Â¦ 001
0000035

So am I correct in say that the file _lwa.cfs is the only file really
needed?

Thanks again. It’s great to see that it really worked and that the only
problem is ‘Windoze’ related. I would be working on Ubuntu but the
SQLServer adapter I tried there could not page through data sets.

jefers · October 20, 2006, 9:13am

On 10/20/06, Jeff G. [email protected] wrote:

did. BTW: this is SQLServer if it matters. BTW: The searching the index
works. well…

Ahhh. I’ve had this problem in Windows before but I thought it was
fixed. For some reason the operating system musn’t be allowing Ferret
to delete the index files when it is finished with them. I’m not sure
why this would be happening though. This would gives us approximately
25_000 + 2500 + 250 + 25 + 2 = 27777 files after merging. This is
still short of the 28300 files you have though.

I found out when I asked to highlight() that I never get anything back.
Looking at the soruce code and my fields I find I must have had (or
defaulted) to :store=>no so I have to retrieve the row, iterate myself
over the fields to find out which field matched, and then display the
results. That is not pretty but I have to admit, it’s painless.

This is one of the reasons I want to implement a database based on
Ferret, so that operations like this will be very simple. I could add
a highlighting method for externally stored fields but you need to
store term vectors for the highlighting to work exactly (ie for
stemmed terms and matching sloppy phrases exactly) so if you are
storing term_vectors, you may as well store the field as well. For
externally stored fields the highlighting method you are using is
best.

Still
25,000 records made 28,000+ files. Can you imagine all 8.1 million
records!! Is it because one of the fields being indexed is always unique
(think User ID/Primary key)?

No, I think the majority of those files are obselete. In fact I’m not
sure if Windows would even allow you to open that many files at once
(and Ferret does open all of the files in the index directory.) If you
open up the segments file you’ll see a list of the segments that are
actually still being used by Ferret (along with a bunch of binary
data). Given that your segments file is only 29 bytes, I’m guessing
that you have optimized your index and you only have one valid index
segment. The rest is junk.

For the record I indexed 2,000,000 records the other day
(approximately 4000kb each) in 2 1/2 hours and I had at most 120 files
in my index directory.

I was going to trying it in Lucene and see what happens. I figure if it
is different, the must be doing something odd in Ferret/AaF. Plus I can
try native ferret to create the index and forego AaF for the initial
index creation (assuming that is a ‘fix’).

Lucene actually records a list of files it fails to delete and
continues to try and delete those files. It’s a bit of a hack and I
was hoping to get away with not doing that in Ferret. Looks like I was
wrong. I wonder why it works for me and not for you. I have XP Home
edition so it should be the same.

jefers · October 20, 2006, 8:26pm

On 10/21/06, Jeff G. [email protected] wrote:

0000035

So am I correct in say that the file _lwa.cfs is the only file really
needed?

Well, sort of. Ferret does write a couple of files while it is
indexing that won’t appear in the segments file. Also don’t delete the
fields file. Otherwise, “lwa” is base 36 integer so any file labled
_lw9 and bellow you can delete.

Thanks again. It’s great to see that it really worked and that the only
problem is ‘Windoze’ related. I would be working on Ubuntu but the
SQLServer adapter I tried there could not page through data sets.

Ferret definitely works on Ubuntu. That’s were it was developed and I
think Jens my actually develop acts_as_ferret on Ubuntu too.

Cheers,
Dave

jefers · October 20, 2006, 7:48pm

Jeff G. wrote:

Thanks everyone for you help.

Sorry for all the typos. To place an ending on this I did indeed move
all the files out except fields, segments, and _lwa.cfs and it -seems-
to be working.

If I have a moment to look at where I may have created a problem closing
files, I will. And I can easily try recreating the index on Ubuntu as it
involves no paging (which I only deal with when interacting with end
users).

Thanks Dave and Jen for the advice and counsel.