Segmentation fault on large index

I’m getting a segmentation fault on a large index (15GB). I’m running
ferret 0.11.4 on OpenSuSE 10.2 with ruby 1.8.6. The segmentation
fault appeared after I optimized the index, see further below for the
error message I got before that. Ferret works perfectly on other
(smaller)
indexes.

Is this a known issue, and if so, is there a workaround?

--------------------- after optimizing the index -----------------------

$ irb
irb(main):001:0> require ‘rubygems’
=> true

irb(main):002:0> require ‘ferret’
=> true

irb(main):003:0> index = Ferret::Index::Index.new(:path =>
“/tmp/myindex”)
=> #<Ferret::Index::Index:0xb7b23330 @writer=nil,
@mon_entering_queue=[], @default_input_field=:id, @mon_count=0,
@qp=nil, @default_field=:,
@options={:dir=>#Ferret::Store::FSDirectory:0xb7b23308,
:path=>"/tmp/myindex", :lock_retry_time=>2,
:analyzer=>#Ferret::Analysis::StandardAnalyzer:0xb7b23268,
:default_field=>:
}, @mon_owner=nil, @auto_flush=false, @open=true,
@dir=#Ferret::Store::FSDirectory:0xb7b23308, @id_field=:id,
@searcher=nil, @mon_waiting_queue=[], @reader=nil, @key=nil,
@close_dir=true>

irb(main):004:0> index.search_each("*:foo") {|id, score| doc =
index[id].load; puts doc.inspect}
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:
[BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i686-linux]

Aborted

---------------------- before optimizing the index ---------------------

IOError (IO Error occured at <except.c>:93 in xraise
Error occured in fs_store.c:293 - fsi_seek_i
seeking pos -1175113459:

):
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in
[]' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:411:in[]’
/usr/local/lib/ruby/1.8/monitor.rb:238:in synchronize' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:403:in[]’
/app/controllers/search_controller.rb:133:in do_search' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:385:insearch_each’
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:384:in
search_each' /usr/local/lib/ruby/1.8/monitor.rb:238:insynchronize’
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/lib/ferret/index.rb:380:in
search_each' /app/controllers/search_controller.rb:131:indo_search’
/app/controllers/search_controller.rb:54:in index' /usr/local/lib/ruby/1.8/benchmark.rb:293:inmeasure’
/app/controllers/search_controller.rb:53:in index' /usr/local/lib/ruby/1.8/benchmark.rb:293:inmeasure’
/app/controllers/search_controller.rb:19:in index' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:insend’
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:1095:in
perform_action_without_filters' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:632:incall_filter’
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:619:in
perform_action_without_benchmark' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:inperform_action_without_rescue’
/usr/local/lib/ruby/1.8/benchmark.rb:293:in measure' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/benchmarking.rb:66:inperform_action_without_rescue’
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/rescue.rb:83:in
perform_action' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:insend’
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:430:in
process_without_filters' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/filters.rb:624:inprocess_without_session_management_support’
/usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/session_management.rb:114:in
process' /usr/local/lib/ruby/gems/1.8/gems/actionpack-1.13.3/lib/action_controller/base.rb:330:inprocess’
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/dispatcher.rb:41:in
dispatch' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:168:inprocess_request’
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:143:in
process_each_request!' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:109:inwith_signal_handler’
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:142:in
process_each_request!' /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:612:ineach_cgi’
/usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:in
each' /usr/local/lib/ruby/gems/1.8/gems/fcgi-0.8.7/lib/fcgi.rb:609:ineach_cgi’
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:141:in
process_each_request!' /usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:55:inprocess!’
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.3/lib/fcgi_handler.rb:25:in
`process!’
/ma/www/virtual/ferret.marketaudit.no/Site/public/dispatch.fcgi:24

Hi all,

I’m looking at useing Ferret for categorizing documents.
Essentially what I have are thousands of query rules that if a document
matches, then it belongs to the category that is associated with that
rule. Normally what we all do is have documents indexed and then run a
query against the index to get back the documents that matche the query.

What I want to do is the inverse. I have thousands of queries and I
want to run all of them against one document at a time. The queries
that match the document essentially categorize the document into the
associated category.

Yes, I am aware that this may not be the best way to approach a
categorization problem, but it is a portion of how our current system
works and I want to investigate ways to replace it and move on to better
mechanism for categorization.

I’m considering using our currenty query language and having it be a DSL
to generates Ruby code.

Esseintially my first whack at using Ferret for this was essentially the
following :

doc = File.read(OPTIONS.input_file)
Ferret::I.new do |index|
    index << doc
    FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers }) 

do |row|
next unless row[:boolean]
top_docs = index.search(row[:boolean])
if top_docs.hits.size > 0 then
puts “Matches : #{row[:name]}”
end
end
end

Short and sweet eh? Basically I’m looking for suggestions on better
ways to means to have thousands of ferret queries (as FQL) run against a
single document. Are there other approached that would be better? API
calls that would do this more efficiently? Means to serialize FQL so
that it doesn’t have to be parsed?

Thought, comments, rants, raves, brainstorms?

enjoy,

-jeremy

Jeremy H. [email protected]

Hi Jeremy,

interesting approach. You might build your Query-Objects once by calling
QueryParser#parse, serialize these Query-Objects ans use them.

IMHO your Problem wouldn’t be query parsing but the amount of queries
that
you are issueing on each document. On the other hand ferret is quite
fast
and it may work out if your process is not that time critical. Have you
considered to combine queries. Ferrets Query Language is quite powerful
and
you might bring down the number of queries if you combine the queries
that
are useful to only one catogorization anyway. Check out the QueryParser
API
regarding this approach.

At least the lines

top_docs = index.search(row[:boolean])
if top_docs.hits.size > 0 then

should read “if index.search(row[:boolean]).total_hits > 0” so that you
don’t need to read in the hits-array to get the size.

As a last tip you might be interested in the underlying code of the
more_like_this method of aaf to get the most used terms in your
documents.
This might be able to let your categorizations “learn” while documents
get
categorized.

Cheers,
Jan

2007/5/14, Jeremy H. [email protected]:

want to run all of them against one document at a time. The queries

        if top_docs.hits.size > 0 then

Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk


Jan P.
Rechtsanwalt

Grünebergstraße 38
22763 Hamburg
Tel +49 (0)40 41265809 Fax +49 (0)40 380178-73022
Mobil +49 (0)171 3516667
http://www.inviado.de

Jeremy H. wrote:

that match the document essentially categorize the document into the
associated category.

Thought, comments, rants, raves, brainstorms?
Random thought that might or might not work, depending on whether your
queries are simple enough and how much data you want back: just invert
the problem. Store the queries in Ferret, and treat your document as
the query. Random example:

irb(main):015:0> index = Index::Index.new
irb(main):016:0> index << “hat”
irb(main):017:0> index << “fox”
irb(main):018:0> doc = “the quick brown fox jumped over the lazy dog”
irb(main):018:0> index.search_each(doc) { |id, score| puts
index[id].load.to_yaml + score.to_s }
— !map:Ferret::Index::LazyDoc
:id: fox
0.0425622686743736
=> 1

I’ve got absolutely no idea how well the query parser will handle larger
documents, but it’s worth a try…

On Mon, May 14, 2007 at 11:11:50AM +0100, Alex Y. wrote:

want to run all of them against one document at a time. The queries
irb(main):016:0> index << “hat”
documents, but it’s worth a try…
I did give some thought to this, but we have some fairly complex
categorization queries, some of which are the equivalent of
SpanTermQuery. Since there is no FQL for those type of queries yet, I
don’t think your approach will work for me. But it is a good idea.

enjoy,

-jeremy

Jeremy H. [email protected]

On Mon, May 14, 2007 at 08:00:02AM +0000, Jan P. wrote:

interesting approach. You might build your Query-Objects once by calling
QueryParser#parse, serialize these Query-Objects ans use them.

Yup, that’s one item I need to look into. One of hte issues is the
query language we’re using right now has ‘NEAR’ keywords so we’ll need
to convert those into SpanTermQuery’s, I’m thinking to have the DSL
generate ruby code, then serialize those Query objects, or maybe just
run them as code.

IMHO your Problem wouldn’t be query parsing but the amount of queries that
you are issueing on each document. On the other hand ferret is quite fast
and it may work out if your process is not that time critical. Have you
considered to combine queries. Ferrets Query Language is quite powerful and
you might bring down the number of queries if you combine the queries that
are useful to only one catogorization anyway. Check out the QueryParser API
regarding this approach.

I will investigate the API more, currently we don’t have multiple
queries that equate to a single category, its a one-to-one relationship
between category and query. The speed of my initial experiments is
within our tolerances, but may not be good for a serial execution. Of
course, since all of this is in a single Memory index, per document, it
could be parallellized.

At least the lines

top_docs = index.search(row[:boolean])
if top_docs.hits.size > 0 then

should read “if index.search(row[:boolean]).total_hits > 0” so that you
don’t need to read in the hits-array to get the size.

Good tip, thanks.

As a last tip you might be interested in the underlying code of the
more_like_this method of aaf to get the most used terms in your documents.
This might be able to let your categorizations “learn” while documents get
categorized.

I will definitely check more into that. Who knows, maybe a
categorization engine based on Feret will fall out of this :slight_smile:

enjoy,

-jeremy

Jeremy H. [email protected]