How can I do my own search limits?

chubbard · October 14, 2006, 11:45pm

I’m trying to add a way to query across associations for a model in
acts_as_ferret. Say I have a model A and it has a relationship with
model B. Like say a Book has many pages. I want to search across the
pages of the Book and produce a list of unique books who’s pages match
the terms. So if I have a page that hits then I will add that book to
my list of results. Right now the multi_search returns all pages and
books that match the query.

This sort of gets difficult with pagination because you can’t just keep
track of it yourself. Also the total hits will include all hits on
pages and books. The way the pagination works today with ferret is you
hand it :offsets and :limit params. But, these are fixed width params.
I could end up with 100’s of pages that all belong to the same book so I
have to skip all of those.

This sort of seems like a different kind of search. Not a multi_search
or find_by_contents, but a find_by_association. Where the hit on the
association returns an object of the associated type.

Is there something in ferret that allows me to scroll through the
results one by one and stop when I’ve reached my limit?

Charlie

chubbard · October 15, 2006, 5:53am

On 10/15/06, Charlie H. [email protected] wrote:

track of it yourself. Also the total hits will include all hits on
pages and books. The way the pagination works today with ferret is you
hand it :offsets and :limit params. But, these are fixed width params.
I could end up with 100’s of pages that all belong to the same book so I
have to skip all of those.

This sort of seems like a different kind of search. Not a multi_search
or find_by_contents, but a find_by_association. Where the hit on the
association returns an object of the associated type.

If I manage to implement the Ferret object database[1] this will be
simple. Currently though there are two ways to do this. You can index
all of the Page data in the Book document, presumably in a :page
field. Or you can store the Book ids in the Pages and create a Book id
set by scanning through all matching pages.

[1] Ferret is now accepting donations - Ferret - Ruby-Forum

Is there something in ferret that allows me to scroll through the
results one by one and stop when I’ve reached my limit?

Sure. Set :limit => :all and call search_each. Then break when you
reach your limit.

chubbard · October 15, 2006, 4:30pm

David B. wrote:

If I manage to implement the Ferret object database[1] this will be
simple. Currently though there are two ways to do this. You can index
all of the Page data in the Book document, presumably in a :page
field. Or you can store the Book ids in the Pages and create a Book id
set by scanning through all matching pages.

[1] Ferret is now accepting donations - Ferret - Ruby-Forum

The first option has problems because a book’s content will be too large
for a single field. It would overrun ferret’s maximum field length.
I’m pretty much doing the second option now. But, it’s drawback is
pagination gets tough. I’m not sure how having the ferret object
database would actually work to solve this problem. How would your
queries express what the user intends? How would it know I want to
include all the Page objects as apart of a search on Books? Seems like
you’d have to specify that sort of thing as options to the search. Like
we have to specify eager loading with :include option to find.

Is there something in ferret that allows me to scroll through the
results one by one and stop when I’ve reached my limit?

Sure. Set :limit => :all and call search_each. Then break when you
reach your limit.

That will work for creating a list of Books, and ensuring a show say 10
unique books per page. But, I won’t be able to tell what the total
number of hits were. Any ideas?

Also it gets hard to do pagination because you can’t compute where the
next window starts and ends. So how do you know what the offset
parameter is for the previous pages. Or the offset for the 9th page is?

Charlie

chubbard · October 16, 2006, 12:05am

David B. wrote:

Well the user would just type their query as usual but you’d write the
query something like:

Books.find(“pages match ‘#{query}’”, :limit => 10)

Or something like that. I haven’t worked the details yet. And you
would be able to specify whether you wanted lazy or eager loading too.

That’s what I guessed you’d have to do. Change the query language to
support this concept. I was actually working on adding a new method to
acts_as_ferret where you could pass these associations matches in like:

Book.find_by_association( query, [:pages], { :limit => 20 } )

Since I can’t change the query language, but I could express the same
sort of behavior. This would result in a multi_index query across Book
and Page indexes. But, tracking total_hits, and paging just don’t work
with this approach. The only option you have is to iterate over all the
matches.

When we do ferret queries does ferret actually go over the entire search
space to calculate all the possible documents that matched the query?
Then just returns the ones within the offset and limits?

If that’s the case then it’s doable to create this type of search, but
it would make more sense to modify ferret to support this type of query.

I’m interested in your database approach. It could help simplify this
problem. It seems doable to add this to acts_as_ferret without needing
a seperate project. Not to mention it’s really needed in Rails apps as
well.

Charlie

chubbard · October 16, 2006, 9:33am

On 10/16/06, Charlie H. [email protected] wrote:

That’s what I guessed you’d have to do. Change the query language to

When we do ferret queries does ferret actually go over the entire search
space to calculate all the possible documents that matched the query?
Then just returns the ones within the offset and limits?

Yes, that’s exactly how it works.

If that’s the case then it’s doable to create this type of search, but
it would make more sense to modify ferret to support this type of query.

I don’t see a way to add this feature cleanly. It is just as easy for
you to do iterate through all the results yourself. Besides, you still
haven’t explained why you can’t add all Pages to each Book document?
As I said, the field length limit isn’t an issue. This would be the
best way to solve this problem.

I’m interested in your database approach. It could help simplify this
problem. It seems doable to add this to acts_as_ferret without needing
a seperate project. Not to mention it’s really needed in Rails apps as
well.

In my suggested database approach the search would be the equivalent
of a simple SQL join query. By adding a feature like this to
acts_as_ferret you’ll need to pull all the matching page ids out of
the index and peform a much slower SQL query for all books that
include those page ids. I’m not sure it is feasible but I’ll leave
that decision to the acts_as_ferret developers. The best solution is
definitely to index all the pages with the book document, even if it
means indexing each page twice.

Cheers,
Dave

chubbard · October 15, 2006, 6:40pm

On 10/15/06, Charlie H. [email protected] wrote:

The first option has problems because a book’s content will be too large
for a single field. It would overrun ferret’s maximum field length.

Then change the maximum field length. IndexWriter has a
:max_field_length parameter.

I’m pretty much doing the second option now. But, it’s drawback is
pagination gets tough. I’m not sure how having the ferret object
database would actually work to solve this problem. How would your
queries express what the user intends? How would it know I want to
include all the Page objects as apart of a search on Books? Seems like
you’d have to specify that sort of thing as options to the search. Like
we have to specify eager loading with :include option to find.

Well the user would just type their query as usual but you’d write the
query something like:

Books.find(“pages match ‘#{query}’”, :limit => 10)

Or something like that. I haven’t worked the details yet. And you
would be able to specify whether you wanted lazy or eager loading too.

Also it gets hard to do pagination because you can’t compute where the
next window starts and ends. So how do you know what the offset
parameter is for the previous pages. Or the offset for the 9th page is?

Scroll through all matches or use option 1.

Cheers,
Dave

chubbard · October 16, 2006, 1:00pm

On 10/16/06, Jens K. [email protected] wrote:

An imho interesting question around this is, how much the size of the
value for that pages field containing all pages of a book really would
influence the total index size (when not storing the contents and not
storing term vectors), i.e. will the index size grow in a linear way, or
will it grow slower over time, as with bigger size of the value of a
field more terms occur more than once ?

Jens

That is an interesting question. I haven’t done any tests to back this
up but I would guess you are correct. Indexing the content as a single
field in Book will take up a lot less space than it would in separated
into multiple documents as pages. So indexing the field twice as I
suggested shouldn’t double the size of your index. In fact, if you
give the fields the same name (ie :content for both Page and Book)
then the increase in index size will be negligable. There will however
be a noticable difference in indexing time but again, it shouldn’t be
double. As far as search goes this solution will probably be orders of
magnitude better.

Dave

chubbard · October 16, 2006, 12:34pm

Hi!

On Mon, Oct 16, 2006 at 04:23:28PM +0900, David B. wrote:

On 10/16/06, Charlie H. [email protected] wrote:
[…]
include those page ids. I’m not sure it is feasible but I’ll leave
that decision to the acts_as_ferret developers. The best solution is
definitely to index all the pages with the book document, even if it
means indexing each page twice.

I’d suggest going that route, too.

An imho interesting question around this is, how much the size of the
value for that pages field containing all pages of a book really would
influence the total index size (when not storing the contents and not
storing term vectors), i.e. will the index size grow in a linear way, or
will it grow slower over time, as with bigger size of the value of a
field more terms occur more than once ?

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

chubbard · October 16, 2006, 6:24pm

On 10/16/06, Charlie H. [email protected] wrote:

There is no reason why I couldn’t. I was just trying to figure out a
way to avoid it. The big drawback to indexing all the pages onto a
single field in book would mean I’d have to pick a size of the field up
front that could be the maximum. I don’t have a lot of data yet, but I
tried running some tests. A 94 chapter book it’s somewhere around of
100,000. But that’s a smaller book. It’s just something you have to
watch closely which I was trying to avoid is all. Right now your right
the best approach is to store it twice.

Set it to Ferret::FIX_INT_MAX. This is the largest number that you set
any of the properties too and effectively sets no limit to the field
length. I’ll add :all as an option at some point.

query didn’t have to match the Book document in order to be included.
It just had to match the Page object to be included. For example, say I
have a book title of Lucene in Action, but you’d expect a query “java”
would pull that one back. Java is probably mentioned in the text of
that book. I sort of saw it as a multi_index query, since aaf maps the
objects that way, where you’d first query Book Documents, then query the
Page documents. Instead of adding those Page documents to the resulting
array. They would only add a new entry if there was a Book not already
there. I suppose I could do that in Ruby, but it just seems like it
might be more optimized if ferret understood this type of relationship
since it is already iterating over this already.

Trust me, Ferret is complex enough as it is without having to
understand relationships between different documents. I need to draw
the line somewhere. If I want to add features like this I need to
design Ferret from the ground up to be more like a database which is
exactly what I intend to do with the Ferret object database. I hope
that makes sense.

Dave

chubbard · October 16, 2006, 4:20pm

David B. wrote:

If that’s the case then it’s doable to create this type of search, but
it would make more sense to modify ferret to support this type of query.

I don’t see a way to add this feature cleanly. It is just as easy for
you to do iterate through all the results yourself. Besides, you still
haven’t explained why you can’t add all Pages to each Book document?
As I said, the field length limit isn’t an issue. This would be the
best way to solve this problem.

There is no reason why I couldn’t. I was just trying to figure out a
way to avoid it. The big drawback to indexing all the pages onto a
single field in book would mean I’d have to pick a size of the field up
front that could be the maximum. I don’t have a lot of data yet, but I
tried running some tests. A 94 chapter book it’s somewhere around of
100,000. But that’s a smaller book. It’s just something you have to
watch closely which I was trying to avoid is all. Right now your right
the best approach is to store it twice.

In my suggested database approach the search would be the equivalent
of a simple SQL join query. By adding a feature like this to
acts_as_ferret you’ll need to pull all the matching page ids out of
the index and peform a much slower SQL query for all books that
include those page ids. I’m not sure it is feasible but I’ll leave
that decision to the acts_as_ferret developers. The best solution is
definitely to index all the pages with the book document, even if it
means indexing each page twice.

I was thinking it would be more like a SQL union. In other words the
query didn’t have to match the Book document in order to be included.
It just had to match the Page object to be included. For example, say I
have a book title of Lucene in Action, but you’d expect a query “java”
would pull that one back. Java is probably mentioned in the text of
that book. I sort of saw it as a multi_index query, since aaf maps the
objects that way, where you’d first query Book Documents, then query the
Page documents. Instead of adding those Page documents to the resulting
array. They would only add a new entry if there was a Book not already
there. I suppose I could do that in Ruby, but it just seems like it
might be more optimized if ferret understood this type of relationship
since it is already iterating over this already.