Ruby Forum Ferret > Indexing an XML/HTML File

Posted by S D (Guest)
on 12.04.2008 07:04
(Received via mailing list)
I'm planning on indexing XML/HTML files. I only want to index the text
contained in the files and not any of the elements or tags. I just 
finished
reading Chapter 6 of "Ferret" (Balmain/O'Reilley) that presented a 
solution
for this issue. The essence of the solution was to parse the XML/HTML 
and
extract the text content using a parser such as Hpricot. My concern is 
that
this approach will not support highlighting of the results [correct me 
if
I'm wrong here] since the corresponding indexed field will only contain 
text
without the elements and tags that are necessary to indicate the 
position of
the text. Question: wouldn't a better approach be to implement a 
tokenizer
that ignores XML/HTML tags and preserves the positions of the 
appropriately
indexed items? If this is indeed an ideal approach does such a solution
exist or, alternatively, how can I contribute when I implement it?

Regards,
John
aka sd.codewarrior