Hi, I have some xml that represents a document. I parse the xml and place specific parts (like the title) into the appropriate fields in my document. The xml contains the normal document elements like a title, body etc. It also contains illustrations, of which there may be 0 or many for a given document. Each illustration also has a title and caption text. I'm struggling to figure out how to index this data, since there are many documents in my xml dataset and each document may have a random number of illustrations. Therefore, I can't just add several fields to my index like illustration1, illustration2, etc. Instead, the only way I can think to do it is grab all of the illustration / caption text for a given document and glob it together into one field, :illustration. This will work fine, searches will match terms in that field. The problem comes when wanting to distinguish which illustration the term belonged to. I have been pouring over the Ferret api docs to find a way to get the positions and offsets for matches from a search, thinking I could then figure out where in my text field the match occurred and consequently, which illustration it belonged to. However, I cannot see how to do this as all of the find / search methods return either a document id (or the model in AAF). I don't see a way to retrieve a list of terms that the search matches and then those term's corresponding positions in a given field. Am I missing something? Is this a horrible way to solve my problem? Does anyone know how to retrieve the list of terms a given query matches? I think from there I can use termEnum to get the positions in the field. Thanks, and thanks to everyone who's helped with Ferret / Acts_as_ferret -km
on 2008-10-22 00:56
on 2008-10-28 20:03
Hi, first of all, please don't use the web forum to ask questions, but use the mailing list (ferret-talk@rubyforge.org). Unfortunately it seems that not every message posted here makes it to the mailing list, and I don't check the forum here very often... The other way around (messages posted via email) works reliably, so in the end you'll reach more people... Karl Meisterheim wrote: > Hi, > > I have some xml that represents a document. I parse the xml and place > specific parts (like the title) into the appropriate fields in my > document. The xml contains the normal document elements like a title, > body etc. It also contains illustrations, of which there may be 0 or > many for a given document. Each illustration also has a title and > caption text. > > I'm struggling to figure out how to index this data, since there are > many documents in my xml dataset and each document may have a random > number of illustrations. Therefore, I can't just add several fields to > my index like illustration1, illustration2, etc. > > Instead, the only way I can think to do it is grab all of the > illustration / caption text for a given document and glob it together > into one field, :illustration. > > This will work fine, searches will match terms in that field. The > problem comes when wanting to distinguish which illustration the term > belonged to. the answer is simple - whatever is the smallest unit you want to get as a search result is what you have to index. So if you want to find out which illustration a query matches you'll have to index each illustration as a separate document (in the Ferret sense of the word). You should then index the document's id along with each illustration, and maybe even shared information like the document title. Or build a separate index for global document data to avoid that redundancy. however then you would have to run each query twice - against the document index, and against the illustrations index. trade off between indexing speed (2 indexes and therefore no indexing of redundant information means faster indexing) versus search speed (searching once vs. searching twice for each user query)... Does that sound like it might work? Cheers, Jens
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.