Extract data from email - Tmail, Hpricot

geo · September 4, 2008, 5:03pm

Hi all,

I have an html email that I would like to parse.

The problem I’m having is removing all html tags and getting past the
header information. Then I want to extract all the information per row
to put into a database.

the email is pasted here: http://pastie.textmate.org/265259

I have tried Tmail, but can’t seem to extract just the body. Then I
tried Hpricot and wasn’t sure what to use before the .inner_html. So
basically I’m very lost on where to start.

Any help is appreciated.

Thanks!

geo · September 4, 2008, 8:42pm

George Cooper wrote:

I have tried Tmail, but can’t seem to extract just the body. Then I
tried Hpricot and wasn’t sure what to use before the .inner_html. So
basically I’m very lost on where to start.

Any help is appreciated.

Thanks!

It would help if you posted some code that didn’t work, so people can
have a better idea of what you’re trying to do. Tmail should have been
able to parse that without problem, however, extracting the body is
easy. The box follows the empty line. You could use something like
split, but duping such huge strings could be slow. When you read the
mail, try to read a line at a time until you get the empty line, then
read the rest into a buffer for hpricot.

geo · September 8, 2008, 5:38pm

Michael M. wrote:

George Cooper wrote:

I have tried Tmail, but can’t seem to extract just the body. Then I
tried Hpricot and wasn’t sure what to use before the .inner_html. So
basically I’m very lost on where to start.

Any help is appreciated.

Thanks!

It would help if you posted some code that didn’t work, so people can
have a better idea of what you’re trying to do. Tmail should have been
able to parse that without problem, however, extracting the body is
easy. The box follows the empty line. You could use something like
split, but duping such huge strings could be slow. When you read the
mail, try to read a line at a time until you get the empty line, then
read the rest into a buffer for hpricot.

Below is the code I am using to try and get the body out of the html
email (copy of email http://pastie.org/265259) .

require ‘rubygems’
require ‘tmail’

email = TMail::Mail.load( ‘emailhtml.eml’ )

puts email[‘body’] # comes back nil
puts email[‘from’]
puts email[‘Delivered-To’]
puts email[‘to’] # comes back nil
puts email[‘subject’]
puts email[‘date’]
puts email[‘X-Originalarrivaltime’]

results:
nil
[email protected]
[email protected]
nil
[Freddy] New Incidents captured on 2008-09-02
Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

geo · September 9, 2008, 1:00am

On Sep 8, 11:31 am, Geo _C [email protected] wrote:

It would help if you posted some code that didn’t work, so people can
require ‘rubygems’
require ‘tmail’

email = TMail::Mail.load( ‘emailhtml.eml’ )

puts email[‘body’] # comes back nil

Don’t see why it would be nil. I would contact Mikel.

http://lindsaar.net/

puts email[‘from’]
puts email[‘Delivered-To’]
puts email[‘to’] # comes back nil

I don’t see a ‘to’ in the header, so is this a surprise?

Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

T.

geo · September 11, 2008, 6:36pm

I got Tmail to extract the body of my email. The solution (very simple
and embarrassing) is below. Now I’m trying to figure out Hpricot, but
examples seem to be fairly thin. If anyone knows of a good tutorial for
beginners, please post. I have been using
http://code.whytheluckystiff.net/doc/hpricot/ , but could use something
more basic.

Thanks for the help!

Thomas S. wrote:

On Sep 8, 11:31ï¿½am, Geo _C [email protected] wrote:

It would help if you posted some code that didn’t work, so people can
require ‘rubygems’
require ‘tmail’

email = TMail::Mail.load( ‘emailhtml.eml’ )

puts email[‘body’] # comes back nil

Don’t see why it would be nil. I would contact Mikel.

I needed to use email.body instead of email[‘body’] to return the body.
thanks Peter!

http://lindsaar.net/

puts email[‘from’]
puts email[‘Delivered-To’]
puts email[‘to’] # comes back nil

I don’t see a ‘to’ in the header, so is this a surprise?

My mistake there. You are correct, there is no ‘to’ for me to use.

Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

T.