Problem with large index file


#1

Hello,

Ferret created a 4.5GB> index file.
$ 4534029210 2007-02-26 12:46 _el.cfs

The creation of the index went smoothly. Searching through this index
also works fine. However whenever I try to get the contents of an
indexed document I get an error when the document number is above
621108:

irb(main):080:0> searcher[621108].load
IOError: IO Error occured at <except.c>:79 in xraise
Error occured in fs_store.c:289 - fsi_seek_i
seeking pos -1206037603:

As you can see it is seeking on a negative position. I did a strace on
this with the following results:

_llseek(3, 18446744072766697140, 0xbfc555e0, SEEK_SET) = -1 EINVAL
(Invalid argument)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(2, "./service.cgi:40:in []\'", 24./service.cgi:40:in[]’) = 24
write(2, ": “, 2: ) = 2
write(2, “IO Error occured at <except.c>:7”…, 43IO Error occured at
<except.c>:79 in xraise) = 43
write(2, " (”, 2 () = 2
write(2, “IOError”, 7IOError) = 7
write(2, “)\n”, 2)
) = 2
write(2, "Error occured in fs_store.c:289 "…, 90Error occured in
fs_store.c:289 - fsi_seek_i
seeking pos -942854476:

The lseek() on 18446744072766697140 is over the maximum of long. That’s
why lseek is probably giving this error.

How can I fix this?


#2

On 2/26/07, Jeffrey G. removed_email_address@domain.invalid wrote:

irb(main):080:0> searcher[621108].load
write(2, "./service.cgi:40:in []\'", 24./service.cgi:40:in[]’) = 24

The lseek() on 18446744072766697140 is over the maximum of long. That’s
why lseek is probably giving this error.

How can I fix this?

Actually 18446744072766697140 is too big for even a 64bit long (or a
long long on 32bit systems) so I’d love to know where that number is
coming from. It is obviously a bug somewhere else. Unfortunately it
would be impractical for you to send me the index. If it is possible
to give me access to your server I should be able to sort this out
though. Otherwise, I’ll look into it, but I can’t promise anything.

Dave


#3

David B. wrote:

Actually 18446744072766697140 is too big for even a 64bit long (or a
long long on 32bit systems) so I’d love to know where that number is
coming from. It is obviously a bug somewhere else. Unfortunately it
would be impractical for you to send me the index. If it is possible
to give me access to your server I should be able to sort this out
though. Otherwise, I’ll look into it, but I can’t promise anything.

Dave

I can’t give access to the server as its a company server, sorry.
Is there a possibility that the index somehow got corrupted? At the
moment I am recreating the index, which takes several days. I’ll report
on the findings when it’s done.


#4

On 2/27/07, Jeffrey G. removed_email_address@domain.invalid wrote:

I can’t give access to the server as its a company server, sorry.
Is there a possibility that the index somehow got corrupted? At the
moment I am recreating the index, which takes several days. I’ll report
on the findings when it’s done.

It could be a corrupt index but I doubt it. I think it is more likely
a bug somewhere else. I have built indexes of this size before without
problem though. Perhaps if you could give me an idea of what type of
data you are putting in the index I could try and rebuild a similar
index here to diagnose the problem. ie. how many documents, how many
fields, what are the field settings (eg stored, untokenized,
term_vectors etc), how large are the fields on average and what sort
of data (eg numbers dates english language, code etc) and also what
analyzer are you using. This should give me enough information to
build a very similar index here and hopefully reproduce the problem.

Cheers,
Dave

PS: send it to me privately if you prefer


#5

I recreated the index with this option :max_merge_docs => 100000 and it
seems to work great.


#6

Hi Jeffrey,

That’s great to hear. If you have a chance, could you try copying the
index (cp -r) and then opening the copy and optimizing it. Then let me
know if you are still getting the same problem you were getting
before. I understand if this is too much trouble. 5Gb is a lot of data
to be playing around with.

Cheers,
Dave


#7

After optimization the exact same problem occurs.

Greetings,
Jeffrey G.

David B. wrote:

Hi Jeffrey,

That’s great to hear. If you have a chance, could you try copying the
index (cp -r) and then opening the copy and optimizing it. Then let me
know if you are still getting the same problem you were getting
before. I understand if this is too much trouble. 5Gb is a lot of data
to be playing around with.

Cheers,
Dave


#8

On 3/6/07, Jeffrey G. removed_email_address@domain.invalid wrote:

After optimization the exact same problem occurs.

Thanks Jeffrey, I’ll keep looking into this. I’m glad your index works
for the moment though.

Cheers,
Dave