Forum: Ferret Ferret with IMAP dirs

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
John Wells (Guest)
on 2006-01-10 15:55
I'd like to use ferret to build an imap indexer and search utility, but
want to check first to see if anyone else is working on this and offer
my help. Anyone?

Also, if you could provide any helpful pointers on indexing directories
via ferret, it'll be very much appreciated. I'm a lucene nuby.

Thanks!
John
Jennyw J. (Guest)
on 2006-01-10 22:29
(Received via mailing list)
John W. wrote:

>I'd like to use ferret to build an imap indexer and search utility, but
>want to check first to see if anyone else is working on this and offer
>my help. Anyone?
>
>
This could be really challenging if you want it to work for multiple
IMAP servers. If you target a specific one, though, you might have
better luck. The biggest issue I see is that the UID of messages,
although implied to always be the same by the IMAP RFC, my understanding
is that it's not always the same on all implementations. Also, it may be
tough to keep track of all changes to a user's inbox.  If there's a way
to communicate with the IMAP server via an API specific to that server,
especially if there's a hook that can be called on updates to the
message store, that would be ideal.

Good luck!

Jen
John W. (Guest)
on 2006-01-11 14:17
jennyw jennyw wrote:
> This could be really challenging if you want it to work for multiple
> IMAP servers. If you target a specific one, though, you might have
> better luck. The biggest issue I see is that the UID of messages,
> although implied to always be the same by the IMAP RFC, my understanding
> is that it's not always the same on all implementations. Also, it may be
> tough to keep track of all changes to a user's inbox.  If there's a way
> to communicate with the IMAP server via an API specific to that server,
> especially if there's a hook that can be called on updates to the
> message store, that would be ideal.

Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP
dirs, but I'm uncertain how it goes about it...that might be a place to
start. Thanks!
Erik H. (Guest)
on 2006-01-12 15:42
(Received via mailing list)
On Jan 11, 2006, at 7:17 AM, John W. wrote:
>> to communicate with the IMAP server via an API specific to that
>> server,
>> especially if there's a hook that can be called on updates to the
>> message store, that would be ideal.
>
> Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP
> dirs, but I'm uncertain how it goes about it...that might be a
> place to
> start. Thanks!

ZOE uses the IMAP (and POP, and others) networking protocols to read
e-mail and then to index it in all sorts of intense and sophisticated
ways.  I'm not sure what Java library ZOE uses for this, but knowing
the creator of it (we met once a couple of years ago) he probably
built his own IMAP API from scratch using sockets.

net/imap is built into Ruby itself, and is probably the way to start
what you're doing.

	Erik
Jennyw J. (Guest)
on 2006-01-12 19:53
(Received via mailing list)
Erik H. wrote:

>ZOE uses the IMAP (and POP, and others) networking protocols to read
>e-mail and then to index it in all sorts of intense and sophisticated
>ways.  I'm not sure what Java library ZOE uses for this, but knowing
>the creator of it (we met once a couple of years ago) he probably
>built his own IMAP API from scratch using sockets.
>
>
I'm pretty sure ZOE downloads all e-mail from the server and into  its
own message store.  You then point your e-mail client to ZOE as your
server. Last I checked, ZOE only supported POP clients, though.

Jen
Erik H. (Guest)
on 2006-01-12 21:11
(Received via mailing list)
On Jan 12, 2006, at 12:49 PM, jennyw wrote:
> own message store.  You then point your e-mail client to ZOE as your
> server. Last I checked, ZOE only supported POP clients, though.

I guess its a bit confusing on what aspect we're talking about here.
ZOE is both a client and a server.  ZOE is both a POP and IMAP
_client_, but also a POP server as well as an SMTP server.  I think
it also serves as an IMAP server, though I'm not entirely sure.

	<http://guests.evectors.it/zoe/>

Pretty snazzy, and it's use of Lucene is uncanny.

The main point here is that ZOE does speak IMAP and can grab mails
from it.

	Erik
John Wells (Guest)
on 2006-01-13 16:31
Erik H. wrote:
> The main point here is that ZOE does speak IMAP and can grab mails
> from it.

Yep, and using net/imap in combination with Ferret is working very well
so far.

What a great project...thanks!

John
John W. (Guest)
on 2006-01-14 06:11
John  Wells wrote:
> Yep, and using net/imap in combination with Ferret is working very well
> so far.

Correction...was working fine. It seems to freeze up when the index
directory size hits around 178 megs (I'm indexing a 2.2 G mail account).

Has anyone else experienced any problems with large indexes? Strace'ing
to the process shows no activity at all, yet CPU utilization by the
process in at 97.6%.

Any ideas?

Btw, the index it was able to create works great...I can't wait to have
the whole 2 GB indexed!

Thanks,
John
John W. (Guest)
on 2006-01-14 06:41
Here's the stack trace when I control+c out of it:
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:in
`scan_until': Interrupt
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:in
`next'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:21:in
`next'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:52:in
`next'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:122:in
`invert_document'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:in
`each'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:in
`invert_document'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:58:in
`add_document'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:158:in
`add_document'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:270:in
`<<'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in
`synchronize'
        from
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in
`<<'
        from /home/jb/ruby/fermail.rb:43:in `index_it'
        from /home/jb/ruby/fermail.rb:18:in `each'
        from /home/jb/ruby/fermail.rb:18:in `index_it'
        from /home/jb/ruby/fermail.rb:70
        from /home/jb/ruby/fermail.rb:64:in `each'
        from /home/jb/ruby/fermail.rb:64
John W. (Guest)
on 2006-01-14 16:56
I removed the message it was hanging on, but it's still stopping at 178
meg, no matter what I do. Any ideas what might be causing this? I have
plenty of disk space...

Thanks,
John
David B. (Guest)
on 2006-01-15 01:28
(Received via mailing list)
Hi John,

I'm not exactly sure what is causing your problem. It may just be that
the 178Mgb mark is the point where you have 10,000 documents being
merged or something. Do you know how documents are in the index at
that point? Anyway, I don't really have time to look into it right now
as I think most of these types of problems will be sorted out when I
finally release the new version of Ferret backed by cFerret. I can't
say when that will be but hopefully it won't be too far away.

Sorry to keep everyone waiting.

Cheers,
Dave
John W. (Guest)
on 2006-01-15 07:57
David B. wrote:
> I'm not exactly sure what is causing your problem. It may just be that
> the 178Mgb mark is the point where you have 10,000 documents being
> merged or something. Do you know how documents are in the index at
> that point? Anyway, I don't really have time to look into it right now
> as I think most of these types of problems will be sorted out when I
> finally release the new version of Ferret backed by cFerret. I can't
> say when that will be but hopefully it won't be too far away.

Hello Dave,

It stops consistently at 2902 documents, but when I disabled fetching of
the email body it went beyond this. Strange error indeed. I'm going to
continue trying to figure out what's going on.

Any chance you could reenable the cFerret svn repository on your server?
Tried to download per instructions but received connection refused.

Thanks for your help!

John
David B. (Guest)
on 2006-01-15 13:31
(Received via mailing list)
> Any chance you could reenable the cFerret svn repository on your server?
> Tried to download per instructions but received connection refused.

done
John W. (Guest)
on 2006-01-16 19:01
David B. wrote:
>> Any chance you could reenable the cFerret svn repository on your server?
>> Tried to download per instructions but received connection refused.

David,

Thanks.

Btw, I'm very interested in still understanding what's causing my
current problem. I'd like to take a stab at it myself, but would ask for
a pointer on getting started. What approach would you take in tracking
this problem down? I thought about running the script in the debugger
but man, the added overhead would've caused it to run forever.

Any debug logging I can enable in ferret? Anything else you could
suggest?

Thanks for the great work and the help!

John
Joost (Guest)
on 2006-01-25 12:05
John, I am very interested in the Ruby-Ferret IMAP search tool. Did you
already manage to index 2Gb of emails? Are you willing to share your
code so I can also search thru my email? It's not yet 2Gb but keeps on
growing :)

Joost
John W. (Guest)
on 2006-01-25 23:07
Joost wrote:
> John, I am very interested in the Ruby-Ferret IMAP search tool. Did you
> already manage to index 2Gb of emails? Are you willing to share your
> code so I can also search thru my email? It's not yet 2Gb but keeps on
> growing :)

Hi Joost,

Well, it's certainly not perfect code...more of a dirty hack to try it
out. And, as noted, if I try to index the body it doesn't fair very
well.

That said, I'd be happy to share it. I'll post it later tonight when I
have access to it.

Thanks,
John
John W. (Guest)
on 2006-01-26 04:14
Ok...it's neither pretty nor clean nor idiomatic Ruby (I'm a nuby ;),
but as a dirty hack it works (unless you fetch the body...that is).

Let me know if you have any questions:

#!/usr/bin/env ruby

require 'rubygems'
require 'ferret'
include Ferret
include Ferret::Document
require 'net/imap'

index = Index::Index.new(:path=>"/path/to/index/goes/here")
$count = 0
$imap = Net::IMAP.new('server_ip_address_goes_here', 143, false)

$imap.login('username_goes_here', 'password_goes_here')

print $imap.examine("INBOX")

def index_it(imapobj, index, box)
	imapobj.search(["ALL"]).each do |message_id|
		begin
    	msg = imapobj.fetch(message_id, "(UID RFC822.SIZE ENVELOPE
BODY[TEXT])")[0]
			envelope = msg.attr["ENVELOPE"]
			body = msg.attr["BODY[TEXT]"]
			uid = msg.attr["UID"]
			size = msg.attr["RFC822.SIZE"]
			date = envelope.date
			subject = envelope.subject
    	if envelope.from != nil and envelope.from.size > 0
				from = envelope.from[0].name
			end
			sender = envelope.sender
			to = envelope.to
			in_reply_to = envelope.in_reply_to
			doc = Document.new
			doc << Field.new("id", message_id, Field::Store::YES,
Field::Index::TOKENIZED)
   		doc << Field.new("body",  body,  Field::Store::NO,
Field::Index::TOKENIZED)
			doc << Field.new("from", from, Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("subject", subject, Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("date", date, Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("uid", uid, Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("size", size, Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("sender", sender, Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("in_reply_to", in_reply_to, Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("mailbox", box, Field::Store::YES,
Field::Index::UNTOKENIZED)

			index << doc
			$count = $count + 1
		  print "#{$count} : #{from} <==> #{subject}\n"
			$retry = 0
		rescue => detail
			print detail
  		print detail.backtrace.join("\n")
  		print "Retrying"
			$retry = 1 + $retry
			if $retry < 20
	  		retry
			else
				print "Retry threshold reached. Exiting..."
				exit!(99)
			end
			$retry = 0
		end
	end
end

$imap.examine("INBOX")

$imap.list("", "*").each do |box|
	name = box.name
	print "NAME: #{name}:#{box.class}\n"
	if name and name != "" and name !~/customflags/
		begin
		$imap.select(name)
		index_it($imap, index, name)
		rescue => detail
			print "ERROR: " + detail.message + "\n"
		end
	end
end
Joost (Guest)
on 2006-01-26 15:38
Hi John,

Thanks for the quick reaction. I'm a nuby too :) At the moment I haven't
got the time to look at the code.. when I have I'll certainly do. I hope
there is a new version of Ferret out by then..so it'll work completely &
fast.

Thanks, Joost
John W. (Guest)
on 2006-01-26 15:53
Joost wrote:
> Hi John,
>
> Thanks for the quick reaction. I'm a nuby too :) At the moment I haven't
> got the time to look at the code.. when I have I'll certainly do. I hope
> there is a new version of Ferret out by then..so it'll work completely &
> fast.

Ok... ;)

Btw, that code only creates the index. You'll then have to implement
code to search it, and you'll probably want it to dig out the UID for
you. Here's a sample of a search:
############################################
#!/usr/bin/env ruby
require 'rubygems'
require 'ferret'
include Ferret
require 'net/imap'

50.times { print "-" }; print "\n"

index = Index::Index.new(:path=>"/path/to/index/goes/here")

index.search_each('body:"' + ARGV[0] + '"') do |doc, score|
    puts "Document #{doc} found with a score of #{score}"
                print index[doc]["from"] + " <--> " +
index[doc]["subject"] +  + index[doc]["uid"] + "\n"
end

50.times { print "-" }; print "\n"
############################################
This topic is locked and can not be replied to.