Adding extra fields to an index (using RDig?)


#1

Hello everyone,

I am writing an application which collects a set of web sites and caches
them locally for offline viewing. I want to do searches on this
collection and associate extra data with each result (e.g date
collected, reason for collection, perhaps a sequence number).

Now all this data exists when the harvesting is done and could be stored
in a database. I want to use RDig to index my collection of sites I also
want to associate the index results with my extra data and display them
along with search results.

The index is built once and searched many times so I want searching to
be as efficient as possible.

The simplest way is to use e.g. the local URL as a key into my database
(easy but needs to be done each time and could slow things down)

Is it possible to add extra fields to ferret index entries?

If so, can this be done at create time or must it be done afterwards? If
it can be done at create time is there a way to get RDig to insert these
extra fields?

Thanks for any help with this

Ed


#2

Hi!

On Sat, Feb 10, 2007 at 12:29:27PM +0100, Ed Ed wrote:

along with search results.

The index is built once and searched many times so I want searching to
be as efficient as possible.

The simplest way is to use e.g. the local URL as a key into my database
(easy but needs to be done each time and could slow things down)

Is it possible to add extra fields to ferret index entries?

of course that is possible, RDig itself uses three different fields -
:url, :title and :data.

If so, can this be done at create time or must it be done afterwards? If
it can be done at create time is there a way to get RDig to insert these
extra fields?

Ferret documents cannot be modified after they have been created, so any
custom fields you want to add have to be added when the index is
created.

Atm RDig doesn’t support custom fields, however I’d be happy to apply a
patch adding this capability :wink:

cheers,
Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer removed_email_address@domain.invalid
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66


#3

Hi,

To summarise, I can add custom fields at create time but not afterwards.
Furthermore RDig does not presently support the addition of custom
fields.

Please could you post your patch to enable RDig to support custom
fields.

Thanks

Ed

Jens K. wrote:

Hi!

On Sat, Feb 10, 2007 at 12:29:27PM +0100, Ed Ed wrote:

Ferret documents cannot be modified after they have been created, so any
custom fields you want to add have to be added when the index is
created.

Atm RDig doesn’t support custom fields, however I’d be happy to apply a
patch adding this capability :wink:


#4

On Sun, Feb 11, 2007 at 07:17:51PM +0100, Ed Ed wrote:

Hi,

To summarise, I can add custom fields at create time but not afterwards.
Furthermore RDig does not presently support the addition of custom
fields.

Right.

Please could you post your patch to enable RDig to support custom
fields.

oh, what I wanted to say is that if you built such a feature into
RDig, I’d be happy to integrate it. Sorry if I’ve been unclear here.

Jens

webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer removed_email_address@domain.invalid
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66


#5

On Mon, Feb 12, 2007 at 12:55:54PM +0100, Ed Ed wrote:
[…]

OK, I’ll have a look at the code and see what might be simplest. Seems
to me that adding an extra optional directive to the configuration file
is easiest. This could name a file containing a user-supplied hook which
rdig/indexer.rb could try to include. Or just define the hook procedure
in the config file?

defining the hook method in the config sounds good.

Then if the hook procedure existed the indexer could pass it the
document and doc data structure and the hook procedure could augment the
doc structure as required.

exactly.

I guess the only Ferret requirement here is that the hook must add the
same set of extra fields to each document (even if values NULL)

not even that, you can have different ferret documents with a different
set of fields.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer removed_email_address@domain.invalid
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66


#6

Jens K. wrote:

oh, what I wanted to say is that if you built such a feature into
RDig, I’d be happy to integrate it. Sorry if I’ve been unclear here.

:frowning:

OK, I’ll have a look at the code and see what might be simplest. Seems
to me that adding an extra optional directive to the configuration file
is easiest. This could name a file containing a user-supplied hook which
rdig/indexer.rb could try to include. Or just define the hook procedure
in the config file?

Then if the hook procedure existed the indexer could pass it the
document and doc data structure and the hook procedure could augment the
doc structure as required.

I guess the only Ferret requirement here is that the hook must add the
same set of extra fields to each document (even if values NULL)

Ed