Ferret with IMAP dirs

I’d like to use ferret to build an imap indexer and search utility, but
want to check first to see if anyone else is working on this and offer
my help. Anyone?

Also, if you could provide any helpful pointers on indexing directories
via ferret, it’ll be very much appreciated. I’m a lucene nuby.

Thanks!
John

John W. wrote:

I’d like to use ferret to build an imap indexer and search utility, but
want to check first to see if anyone else is working on this and offer
my help. Anyone?

This could be really challenging if you want it to work for multiple
IMAP servers. If you target a specific one, though, you might have
better luck. The biggest issue I see is that the UID of messages,
although implied to always be the same by the IMAP RFC, my understanding
is that it’s not always the same on all implementations. Also, it may be
tough to keep track of all changes to a user’s inbox. If there’s a way
to communicate with the IMAP server via an API specific to that server,
especially if there’s a hook that can be called on updates to the
message store, that would be ideal.

Good luck!

Jen

jennyw jennyw wrote:

This could be really challenging if you want it to work for multiple
IMAP servers. If you target a specific one, though, you might have
better luck. The biggest issue I see is that the UID of messages,
although implied to always be the same by the IMAP RFC, my understanding
is that it’s not always the same on all implementations. Also, it may be
tough to keep track of all changes to a user’s inbox. If there’s a way
to communicate with the IMAP server via an API specific to that server,
especially if there’s a hook that can be called on updates to the
message store, that would be ideal.

Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP
dirs, but I’m uncertain how it goes about it…that might be a place to
start. Thanks!

Erik H. wrote:

ZOE uses the IMAP (and POP, and others) networking protocols to read
e-mail and then to index it in all sorts of intense and sophisticated
ways. I’m not sure what Java library ZOE uses for this, but knowing
the creator of it (we met once a couple of years ago) he probably
built his own IMAP API from scratch using sockets.

I’m pretty sure ZOE downloads all e-mail from the server and into its
own message store. You then point your e-mail client to ZOE as your
server. Last I checked, ZOE only supported POP clients, though.

Jen

On Jan 12, 2006, at 12:49 PM, jennyw wrote:

own message store. You then point your e-mail client to ZOE as your
server. Last I checked, ZOE only supported POP clients, though.

I guess its a bit confusing on what aspect we’re talking about here.
ZOE is both a client and a server. ZOE is both a POP and IMAP
client, but also a POP server as well as an SMTP server. I think
it also serves as an IMAP server, though I’m not entirely sure.

<http://guests.evectors.it/zoe/>

Pretty snazzy, and it’s use of Lucene is uncanny.

The main point here is that ZOE does speak IMAP and can grab mails
from it.

Erik

On Jan 11, 2006, at 7:17 AM, John W. wrote:

to communicate with the IMAP server via an API specific to that
server,
especially if there’s a hook that can be called on updates to the
message store, that would be ideal.

Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP
dirs, but I’m uncertain how it goes about it…that might be a
place to
start. Thanks!

ZOE uses the IMAP (and POP, and others) networking protocols to read
e-mail and then to index it in all sorts of intense and sophisticated
ways. I’m not sure what Java library ZOE uses for this, but knowing
the creator of it (we met once a couple of years ago) he probably
built his own IMAP API from scratch using sockets.

net/imap is built into Ruby itself, and is probably the way to start
what you’re doing.

Erik

Erik H. wrote:

The main point here is that ZOE does speak IMAP and can grab mails
from it.

Yep, and using net/imap in combination with Ferret is working very well
so far.

What a great project…thanks!

John

John Wells wrote:

Yep, and using net/imap in combination with Ferret is working very well
so far.

Correction…was working fine. It seems to freeze up when the index
directory size hits around 178 megs (I’m indexing a 2.2 G mail account).

Has anyone else experienced any problems with large indexes? Strace’ing
to the process shows no activity at all, yet CPU utilization by the
process in at 97.6%.

Any ideas?

Btw, the index it was able to create works great…I can’t wait to have
the whole 2 GB indexed!

Thanks,
John

I removed the message it was hanging on, but it’s still stopping at 178
meg, no matter what I do. Any ideas what might be causing this? I have
plenty of disk space…

Thanks,
John

Here’s the stack trace when I control+c out of it:
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:in
scan_until': Interrupt from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:innext’
from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:21:in
next' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:52:innext’
from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:122:in
invert_document' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:ineach’
from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:in
invert_document' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:58:inadd_document’
from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:158:in
add_document' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:270:in<<’
from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in
synchronize' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in<<’
from /home/jb/ruby/fermail.rb:43:in index_it' from /home/jb/ruby/fermail.rb:18:ineach’
from /home/jb/ruby/fermail.rb:18:in index_it' from /home/jb/ruby/fermail.rb:70 from /home/jb/ruby/fermail.rb:64:ineach’
from /home/jb/ruby/fermail.rb:64

Hi John,

I’m not exactly sure what is causing your problem. It may just be that
the 178Mgb mark is the point where you have 10,000 documents being
merged or something. Do you know how documents are in the index at
that point? Anyway, I don’t really have time to look into it right now
as I think most of these types of problems will be sorted out when I
finally release the new version of Ferret backed by cFerret. I can’t
say when that will be but hopefully it won’t be too far away.

Sorry to keep everyone waiting.

Cheers,
Dave

David B. wrote:

I’m not exactly sure what is causing your problem. It may just be that
the 178Mgb mark is the point where you have 10,000 documents being
merged or something. Do you know how documents are in the index at
that point? Anyway, I don’t really have time to look into it right now
as I think most of these types of problems will be sorted out when I
finally release the new version of Ferret backed by cFerret. I can’t
say when that will be but hopefully it won’t be too far away.

Hello Dave,

It stops consistently at 2902 documents, but when I disabled fetching of
the email body it went beyond this. Strange error indeed. I’m going to
continue trying to figure out what’s going on.

Any chance you could reenable the cFerret svn repository on your server?
Tried to download per instructions but received connection refused.

Thanks for your help!

John

Any chance you could reenable the cFerret svn repository on your server?
Tried to download per instructions but received connection refused.

done

John, I am very interested in the Ruby-Ferret IMAP search tool. Did you
already manage to index 2Gb of emails? Are you willing to share your
code so I can also search thru my email? It’s not yet 2Gb but keeps on
growing :slight_smile:

Joost

David B. wrote:

Any chance you could reenable the cFerret svn repository on your server?
Tried to download per instructions but received connection refused.

David,

Thanks.

Btw, I’m very interested in still understanding what’s causing my
current problem. I’d like to take a stab at it myself, but would ask for
a pointer on getting started. What approach would you take in tracking
this problem down? I thought about running the script in the debugger
but man, the added overhead would’ve caused it to run forever.

Any debug logging I can enable in ferret? Anything else you could
suggest?

Thanks for the great work and the help!

John

Joost wrote:

John, I am very interested in the Ruby-Ferret IMAP search tool. Did you
already manage to index 2Gb of emails? Are you willing to share your
code so I can also search thru my email? It’s not yet 2Gb but keeps on
growing :slight_smile:

Hi Joost,

Well, it’s certainly not perfect code…more of a dirty hack to try it
out. And, as noted, if I try to index the body it doesn’t fair very
well.

That said, I’d be happy to share it. I’ll post it later tonight when I
have access to it.

Thanks,
John

Ok…it’s neither pretty nor clean nor idiomatic Ruby (I’m a nuby ;),
but as a dirty hack it works (unless you fetch the body…that is).

Let me know if you have any questions:

#!/usr/bin/env ruby

require ‘rubygems’
require ‘ferret’
include Ferret
include Ferret::Document
require ‘net/imap’

index = Index::Index.new(:path=>"/path/to/index/goes/here")
$count = 0
$imap = Net::IMAP.new(‘server_ip_address_goes_here’, 143, false)

$imap.login(‘username_goes_here’, ‘password_goes_here’)

print $imap.examine(“INBOX”)

def index_it(imapobj, index, box)
imapobj.search([“ALL”]).each do |message_id|
begin
msg = imapobj.fetch(message_id, “(UID RFC822.SIZE ENVELOPE
BODY[TEXT])”)[0]
envelope = msg.attr[“ENVELOPE”]
body = msg.attr[“BODY[TEXT]”]
uid = msg.attr[“UID”]
size = msg.attr[“RFC822.SIZE”]
date = envelope.date
subject = envelope.subject
if envelope.from != nil and envelope.from.size > 0
from = envelope.from[0].name
end
sender = envelope.sender
to = envelope.to
in_reply_to = envelope.in_reply_to
doc = Document.new
doc << Field.new(“id”, message_id, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“body”, body, Field::Store::NO,
Field::Index::TOKENIZED)
doc << Field.new(“from”, from, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“subject”, subject, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“date”, date, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“uid”, uid, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“size”, size, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“sender”, sender, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“in_reply_to”, in_reply_to, Field::Store::YES,
Field::Index::TOKENIZED)
doc << Field.new(“mailbox”, box, Field::Store::YES,
Field::Index::UNTOKENIZED)

		index << doc
		$count = $count + 1
	  print "#{$count} : #{from} <==> #{subject}\n"
		$retry = 0
	rescue => detail
		print detail
	print detail.backtrace.join("\n")
	print "Retrying"
		$retry = 1 + $retry
		if $retry < 20
  		retry
		else
			print "Retry threshold reached. Exiting..."
			exit!(99)
		end
		$retry = 0
	end
end

end

$imap.examine(“INBOX”)

$imap.list("", “*”).each do |box|
name = box.name
print “NAME: #{name}:#{box.class}\n”
if name and name != “” and name !~/customflags/
begin
$imap.select(name)
index_it($imap, index, name)
rescue => detail
print "ERROR: " + detail.message + “\n”
end
end
end

Joost wrote:

Hi John,

Thanks for the quick reaction. I’m a nuby too :slight_smile: At the moment I haven’t
got the time to look at the code… when I have I’ll certainly do. I hope
there is a new version of Ferret out by then…so it’ll work completely &
fast.

Ok… :wink:

Btw, that code only creates the index. You’ll then have to implement
code to search it, and you’ll probably want it to dig out the UID for
you. Here’s a sample of a search:
############################################
#!/usr/bin/env ruby
require ‘rubygems’
require ‘ferret’
include Ferret
require ‘net/imap’

50.times { print “-” }; print “\n”

index = Index::Index.new(:path=>"/path/to/index/goes/here")

index.search_each(‘body:"’ + ARGV[0] + ‘"’) do |doc, score|
puts “Document #{doc} found with a score of #{score}”
print index[doc][“from”] + " <–> " +
index[doc][“subject”] + + index[doc][“uid”] + “\n”
end

50.times { print “-” }; print “\n”
############################################

Hi John,

Thanks for the quick reaction. I’m a nuby too :slight_smile: At the moment I haven’t
got the time to look at the code… when I have I’ll certainly do. I hope
there is a new version of Ferret out by then…so it’ll work completely &
fast.

Thanks, Joost