Enhancing the Gateway (Help Needed)

Here’s the short-story on the current situation with our mailing list
to Usenet gateway:

  • Our Usenet host rejects multipart/alternative messages
    because they are technically illegal Usenet posts
  • This means that some emails do not reach comp.lang.ruby
    (several messages each day according to the logs)
  • We don’t like this

To solve this, we want to enhance the gateway to convert multipart/
alternative messages into something we can legally post to Usenet. I
have two thoughts on this strategy:

  1. If possible, we should gather all text/plain portions of an email
    and post those with a content-type of text/plain
  2. If that fails, we can just post the original body but force the
    content-type to text/plain for maximum compatibility

Now I need all of you email and Usenet experts to tell me if that’s a
sane strategy. If another approach would be better, please clue me in.

I’ve pretty much made it this far. The code at the bottom of this
message is the mail_to_news.rb script used by the gateway rewritten
using this strategy.

If you aren’t familiar with the gateway code, you can get details
from the articles at:

There’s one problem left I know I haven’t solved correctly. Help me
figure out a decent strategy for this last piece and we can deploy
the new code.

The outstanding issue is how to handle character sets for the
constructed message. You’ll see in the code below that I just pull
the charset param from the original message, but after looking at a
few messages, I realize that this doesn’t make sense. For example,
here are the relevant portions of a recent post that wasn’t gated
correctly:

Content-Type: multipart/alternative; boundary=Apple-Mail-18-445454026

–Apple-Mail-18-445454026
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

As you can see, the overall email doesn’t have a charset but each
text portion can. If we are going to merge these parts, what’s the
best strategy for handling the charset?

I thought of trying to convert them all to UTF-8 with Iconv, but I’m
not sure what to do if a type doesn’t declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

If you are feeling really adventurous, rewrite the relevant portion
of the code below which I will bracket with a FIX ME comments.

Here’s the script:

#!/usr/bin/env ruby

written by James Edward G. II [email protected]

$KCODE = “u”

GATEWAY_DIR = File.join(File.dirname(FILE), “…”).freeze

$LOAD_PATH << File.join(GATEWAY_DIR, “config”) << File.join
(GATEWAY_DIR, “lib”)

require “tmail”

require “servers_config”
require “nntp”

require “logger”
require “timeout”

prepare log

log = Logger.new(ARGV.shift || $stdout)
log.datetime_format = "%Y-%m-%d %H:%M "

build incoming and outgoing message object

incoming = TMail::Mail.parse($stdin.read)
outgoing = TMail::Mail.new

skip any flagged messages

if incoming[“X-Rubymirror”].to_s == “yes”
log.info “Skipping message ##{incoming.message_id}, sent by
news_to_mail”
exit
elsif incoming[“X-Spam-Status”].to_s =~ /\AYes/
log.info "Ignoring Spam ##{incoming.message_id}: " +
“#{incoming.subject}–#{incoming.from}”
exit
end

only allow certain headers through

%w[from subject in_reply_to transfer_encoding date].each do |header|
outgoing.send(“#{header}=”, incoming.send(header))
end
outgoing.message_id = incoming.message_id.sub(/.+>$/, “>”)
%w[X-ML-Name X-Mail-Count X-X-Sender].each do |header|
outgoing[header] = incoming[header].to_s if incoming.key?header
end

doctor headers for Ruby T.

outgoing.references = if incoming.key? “References”
incoming.references
else
if incoming.key? “In-Reply-To”
incoming.reply_to
else
if incoming.subject =~ /^Re:/
outgoing.reply_to = “this_is_a_dummy_message-id@rubygateway
end
end
end
outgoing[“X-Ruby-Talk”] = incoming.message_id
outgoing[“X-Received-From”] = <<END_GATEWAY_DETAILS.gsub(/\s+/, " ")
This message has been automatically forwarded from the ruby-talk
mailing list by
a gateway at #{ServersConfig::NEWSGROUP}. If it is SPAM, it did not
originate at
#{ServersConfig::NEWSGROUP}. Please report the original sender, and
not us.
Thanks! For more details about this gateway, please visit:

END_GATEWAY_DETAILS
outgoing[“X-Rubymirror”] = “Yes”

translate the body of the message, if needed

if incoming.multipart? and incoming.sub_type == “alternative”

FIX ME

handle multipart/alternative messages

extract body

body = “”
extract_text = lambda do |message_or_part|
if message_or_part.multipart?
message_or_part.each_part { |part| extract_text[part] }
elsif message_or_part.content_type == “text/plain”
body += message_or_part.body
end
end
extract_text[incoming]
if body.empty?
outgoing.body = "Note: the content-type of this message was
altered by " +
“the gateway.\n\n#{incoming.body}”
else
outgoing.body = "Note: non-text portions of this message were
stripped " +
“by the gateway.\n\n#{body}”
end

set the content type of the new message

outgoing.set_content_type( “text”, “plain”,
“charset” => incoming.type_param
(“charset”) )

END FIX ME

else
%w[content_type body].each do |header|
outgoing.send(“#{header}=”, incoming.send(header))
end
end

log.info "Sending message ##{incoming.message_id}: " +
“#{incoming.subject}–#{incoming.from}…”
log.info “Message looks like:\n#{outgoing.encoded}”

connect to NNTP host

begin
nntp = nil
Timeout.timeout(30) do
nntp = Net::NNTP.new( ServersConfig::NEWS_SERVER,
Net::NNTP::NNTP_PORT,
ServersConfig::NEWS_USER,
ServersConfig::NEWS_PASS )
end
rescue Timeout::Error
log.error “The NNTP connection timed out”
exit -1
rescue
log.fatal “Unable to establish connection to NNTP host:
#{$!.message}”
exit -1
end

attempt to send newsgroup post

unless $DEBUG
begin
result = nil
Timeout.timeout(30) { result = nntp.post(outgoing.encoded) }
rescue Timeout::Error
log.error “The NNTP post timed out”
exit -1
rescue
log.fatal “Unable to post to NNTP host: #{$!.message}”
exit -1
end
log.info “… Sent. nntp.post() result: #{result}”
end

END

Thanks for the help.

James Edward G. II

From: “James Edward G. II” [email protected]

  1. If possible, we should gather all text/plain portions of an email
    and post those with a content-type of text/plain

Do we get many HTML-only messages, having a text/html part, without a
corresponding text/plain part?

Or is that too uncommon to worry about?

Regards,

Bill

Hi,

At Mon, 29 Oct 2007 06:20:48 +0900,
James Edward G. II wrote in [ruby-talk:276334]:

To solve this, we want to enhance the gateway to convert multipart/
alternative messages into something we can legally post to Usenet. I
have two thoughts on this strategy:

  1. If possible, we should gather all text/plain portions of an email
    and post those with a content-type of text/plain

Rather I want it to be done by FML itself on ruyb-lang.org.

  1. If that fails, we can just post the original body but force the
    content-type to text/plain for maximum compatibility

I do it locally by w3m -dump -T text/html.

Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

As you can see, the overall email doesn’t have a charset but each
text portion can. If we are going to merge these parts, what’s the
best strategy for handling the charset?

“alternative” means each bodies have actually same contents,
so, in theoretically, you can and should select one of them.
Merging them all is wrong behavior. I suspect you mean
multipart/relative.

I thought of trying to convert them all to UTF-8 with Iconv, but I’m
not sure what to do if a type doesn’t declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

Should be defaulted to US-ASCII.

On Oct 28, 2007, at 10:00 PM, Nobuyoshi N. wrote:

Rather I want it to be done by FML itself on ruyb-lang.org.

Excellent. Are their any plans to make that happen?

I’m trying to get it in the gateway so we can stop having this
discussion. :wink: But if there are plans to have the list itself do
it, that’s great.

  1. If that fails, we can just post the original body but force the
    content-type to text/plain for maximum compatibility

I do it locally by w3m -dump -T text/html.

Yes, I assume we could use lynx/links to similar effect. My strategy
wasn’t as clever, but I thought by swapping the content type we would
at least get the content, though it would have some noise.

–Apple-Mail-18-445454026
“alternative” means each bodies have actually same contents,
so, in theoretically, you can and should select one of them.
Merging them all is wrong behavior.

Now you know why I asked for help. I know so little about email
rules. Thanks for explaining this.

This is good news because it greatly simplifies the process.

Do you know if multipart content can be nested? For example, could a
single part of a multipart message itself be multipart? The design
of TMail seems to support this, but again it’s easier if that’s not
the case.

I suspect you mean multipart/relative.

I wasn’t even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?

I thought of trying to convert them all to UTF-8 with Iconv, but I’m
not sure what to do if a type doesn’t declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

Should be defaulted to US-ASCII.

Do you mean that US-ASCII is the charset when one is not specified?

Thanks for all for the information.

James Edward G. II

Hi,

At Mon, 29 Oct 2007 12:18:40 +0900,
James Edward G. II wrote in [ruby-talk:276357]:

  1. If possible, we should gather all text/plain portions of an email
    and post those with a content-type of text/plain

Rather I want it to be done by FML itself on ruyb-lang.org.

Excellent. Are their any plans to make that happen?

I’m asking to eban.

Do you know if multipart content can be nested? For example, could a
single part of a multipart message itself be multipart? The design
of TMail seems to support this, but again it’s easier if that’s not
the case.

Yes, and the depth isn’t restricted.

I suspect you mean multipart/relative.

I wasn’t even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?

As the above.

I thought of trying to convert them all to UTF-8 with Iconv, but I’m
not sure what to do if a type doesn’t declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

Should be defaulted to US-ASCII.

Do you mean that US-ASCII is the charset when one is not specified?

RFC 2045 Internet Message Bodies November 1996

5.2. Content-Type Defaults

Default RFC 822 messages without a MIME Content-Type header are taken
by this protocol to be plain text in the US-ASCII character set,
which can be explicitly specified as:

 Content-type: text/plain; charset=us-ascii

This default is assumed if no Content-Type header field is specified.

On Oct 28, 2007, at 6:39 PM, Bill K. wrote:

From: “James Edward G. II” [email protected]

  1. If possible, we should gather all text/plain portions of an
    email and post those with a content-type of text/plain

Do we get many HTML-only messages, having a text/html part, without a
corresponding text/plain part?

I know I have seen it at least once in the past. I suspect it’s
rare, but that’s just me guessing. When dealing with the Internet at
large, I think we always need to be prepared for the worst case
scenario.

Or is that too uncommon to worry about?

You made a good point here that I should try looking at some actual
Ruby T. messages to see what we’re up against. I’ll put together a
script to comb through a subset of the archives…

James Edward G. II

Hi,

At Mon, 29 Oct 2007 13:17:24 +0900,
Nobuyoshi N. wrote in [ruby-talk:276371]:

I suspect you mean multipart/relative.

I wasn’t even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?

As the above.

Oops, it was multipart/related, and I removed the paragraph
mentioned about it. My mistake, sorry.

On Oct 28, 2007, at 10:22 PM, James Edward G. II wrote:

On Oct 28, 2007, at 6:39 PM, Bill K. wrote:

Or is that too uncommon to worry about?

You made a good point here that I should try looking at some actual
Ruby T. messages to see what we’re up against. I’ll put together
a script to comb through a subset of the archives…

For the curious, running this script:

#!/usr/bin/env ruby -KU

require “tmail”

raw = “”
ARGF.each_line do |line|
if line =~ /\AFrom / and not raw.empty?
begin
email = TMail::Mail.parse(raw)
if [email.to, email.cc].join(“,”).include?(“ruby-talk@ruby-
lang.org”) and
email.multipart? and email.sub_type != “mixed”
puts “#{email[‘X-Mail-Count’]}: #{email.content_type} " +
“(#{email.type_param(‘charset’)})”
email.each_part do |part|
puts " #{part.content_type} (#{part.type_param(‘charset’)})”
end
end
rescue
# do nothing–skip the bad message
end
raw = line
else
raw += line
end
end

END

on an mbox file of my trash which includes a lot of Ruby T. posts,
I get the attached results.

For the record, multipart/signed messages do seem to be gated correctly.

James Edward G. II

James Edward G. II wrote:

RFC 2387 - The MIME Multipart/Related Content-type (RFC2387)

This type does not seem easy to deal with and I open to suggestions for
the best strategy to use.

AFAIK it’s mostly used for HTML messages with images embedded in the
email itself. I guess it would mostly be one part of a
multipart/alternative message, of which one alternative should be
text/plain anyway. Otherwise, you’re most likely left with HTML to
strip, and images which you may either drop or attach to the output as
files.

Sorry if I happen to be wrong on one point or the other.

mortee

On Oct 28, 2007, at 11:35 PM, Nobuyoshi N. wrote:

As the above.

Oops, it was multipart/related, and I removed the paragraph
mentioned about it. My mistake, sorry.

I’ve been looking into this a little this morning.

We do receive multipart/related messages, though they seem fairly
uncommon compared to multipart/alternative. They don’t appear to be
gated properly. In fact, the mailing list archives don’t even seem
to show them. For example 271796 was a multipart/related message and
I can’t find it in the archives or on comp.lang.ruby.

To understand what we are dealing with here, I read:

RFC 2387 - The MIME Multipart/Related Content-type (RFC2387)

This type does not seem easy to deal with and I open to suggestions
for the best strategy to use.

James Edward G. II

On Oct 29, 2007, at 9:20 AM, mortee wrote:

To understand what we are dealing with here, I read:

RFC 2387 - The MIME Multipart/Related Content-type (RFC2387)

This type does not seem easy to deal with and I open to
suggestions for
the best strategy to use.

AFAIK it’s mostly used for HTML messages with images embedded in the
email itself.

Yeah, I think that’s what I’m seeing in my analysis of the messages.

I guess it would mostly be one part of a multipart/alternative
message, of which one alternative should be text/plain anyway.

Most of the cases I have found have a multipart/alternative section
inside the multipart/related section, like this example shows:

271796: multipart/related ()
multipart/alternative ()
image/png ()

Obviously I need to extend my statistics gathering script to handle
the nesting, but I’ve checked this message by hand and there was a
text/plain part in there.

Otherwise, you’re most likely left with HTML to
strip, and images which you may either drop or attach to the output as
files.

Right. Which means I still need to settle on an HTML strategy as well.

Sorry if I happen to be wrong on one point or the other.

The other usage that seems common, more common than the HTML case in
fact, is as part of a signed message:

271822: multipart/signed ()
multipart/related ()
application/pgp-signature ()

I’ve not yet checked to see if these messages are gated properly with
our current setup.

James Edward G. II

Todd B. wrote:

The lowest common denominator for language is US-ASCII (is that a good
thing or bad thing? You decide).

Aside from any language bias: the language of this list/group is
certainly English, which does just well in ASCII. So IMHO we wouldn’t
loose much by falling back to that in case of some iconv errors. At
least certainly not as much as it’d be worth extraneous effort to work
around.

mortee

On 10/29/07, James Edward G. II [email protected] wrote:

alternative. What is the purpose of multipart/relative?
gated properly. In fact, the mailing list archives don’t even seem
James Edward G. II
I haven’t built enough clout in this group for my opinion to matter,
but here goes…

James did a great job with the gateway … no doubt about that.
Should we even have it? I absolutely think so.

The lowest common denominator for language is US-ASCII (is that a good
thing or bad thing? You decide).

Make sure, James and others, that you label the reformed
emails/postings with some kind of rejoinder that says something to the
effect of “mail/posting has been modified to make it available.”

Todd

Le 29 octobre à 16:06, James Edward G. II a écrit :

On Oct 29, 2007, at 9:20 AM, mortee wrote:

Otherwise, you’re most likely left with HTML to
strip, and images which you may either drop or attach to the output as
files.

Right. Which means I still need to settle on an HTML strategy as well.

I’m not sure you have that many HTML only messages. For my mailbox, I
have an HTML-only filter. It catches 0.5% of my incoming mail, and it’s
100% spam.

OTOH, I seem to recall we looked at a weird multipart/alternative
message recently which had only one plain text part.

our current setup.
Yes. I have [email protected] / ruby-talk 276326,
for instance. I can’t guarantee it’s propagated as well as a pure text
message, but it should be on most servers.

Fred

On Oct 29, 2007, at 10:02 AM, Todd B. wrote:

multipart/mixed (which our Usenet host will allow) and multipart/
uncommon compared to multipart/alternative. They don’t appear to be

James Edward G. II

I haven’t built enough clout in this group for my opinion to matter,
but here goes…

I’m in over my head with all this email stuff and need all the help I
can get. The gateway belongs to all of us, not my. So don’t be
shy. Help me fix this right and we all benefit.

James did a great job with the gateway … no doubt about that.

Just to be totally clear, I didn’t make the original gateway. I’m
just the current caretaker.

Make sure, James and others, that you label the reformed
emails/postings with some kind of rejoinder that says something to the
effect of “mail/posting has been modified to make it available.”

I will absolutely do this. The code I posted earlier in this thread
already does.

James Edward G. II

Fred, you always show up when I need you. That’s why you’re still my
best friend. :wink:

On Oct 29, 2007, at 1:55 PM, F. Senault wrote:

well.

I’m not sure you have that many HTML only messages. For my mailbox, I
have an HTML-only filter. It catches 0.5% of my incoming mail, and
it’s 100% spam.

Yes, you may be right about that. Perhaps not much of a concern.
I’m not seeing any such messages in my sample data.

OTOH, I seem to recall we looked at a weird multipart/alternative
message recently which had only one plain text part.

Sadly, that’s extremely common. Have a look at just the beginning of
my sample data:

271456: multipart/alternative ()
text/plain (UTF-8)
271541: multipart/signed ()
text/plain (utf-8)
application/pgp-signature ()
271567: multipart/signed ()
text/plain (iso-8859-1)
application/pgp-signature ()
271588: multipart/signed ()
text/plain (utf-8)
application/pgp-signature ()
271569: multipart/alternative ()
text/plain (ISO-8859-1)
271578: multipart/alternative ()
text/plain (ISO-8859-1)
271566: multipart/signed ()
text/plain (iso-8859-1)
application/pgp-signature ()
271568: multipart/alternative ()
text/plain (ISO-8859-1)
271444: multipart/alternative ()
text/plain (ISO-8859-1)
271452: multipart/alternative ()
text/plain (ISO-8859-1)
271640: multipart/alternative ()
text/plain (UTF-8)
271669: multipart/alternative ()
text/plain (ISO-8859-1)

Good thing those are super easy to fix. :wink:

our current setup.

Yes. I have [email protected] / ruby-talk
276326,
for instance. I can’t guarantee it’s propagated as well as a pure
text
message, but it should be on most servers.

Awesome. That’s good to know. Thanks for checking that for me.

James Edward G. II

Le 28 octobre à 22:20, James Edward G. II a écrit :

Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

As you can see, the overall email doesn’t have a charset but each
text portion can. If we are going to merge these parts, what’s the
best strategy for handling the charset?

Well, usually, you don’t have more than one charset in a message ; you
should push the charset of the part back to the main header and be done
with it.

Now, if you have more than one text part and different charsets, it’s a
bit more complicated…

I thought of trying to convert them all to UTF-8 with Iconv, but I’m
not sure what to do if a type doesn’t declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

Hm… Complain to the poster / the software writer ? :slight_smile:

Fred

On Oct 28, 2007, at 4:20 PM, James Edward G. II wrote:

Now I need all of you email and Usenet experts to tell me if that’s
a sane strategy.

OK, here is the revised plan folks. Complain now if you see flaws:

  • The gateway will only alter messages with a top-level content-type
    of multipart/alternative or multipart/related
  • For both types of messages, if will search for the first text/plain
    part and promote that to the body, discarding other types (this is
    probably not the ideal handling multipart/related, but it seems to
    fit the messages we are seeing on Ruby T.)
  • All modified messages will begin with a disclaimer on the first line

James Edward G. II

On Oct 29, 2007, at 2:15 PM, F. Senault wrote:

Le 28 octobre à 22:20, James Edward G. II a écrit :

I thought of trying to convert them all to UTF-8 with Iconv, but I’m
not sure what to do if a type doesn’t declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

Hm… Complain to the poster / the software writer ? :slight_smile:

Good plan. :wink:

James Edward G. II

On Oct 29, 2007, at 3:46 PM, James Edward G. II wrote:

On Oct 28, 2007, at 4:20 PM, James Edward G. II wrote:

Now I need all of you email and Usenet experts to tell me if
that’s a sane strategy.

OK, here is the revised plan folks. Complain now if you see flaws:

I forgot one detail…

  • The gateway will only alter messages with a top-level content-
    type of multipart/alternative or multipart/related
  • For both types of messages, if will search for the first text/
    plain part and promote that to the body, discarding other types
    (this is probably not the ideal handling multipart/related, but it
    seems to fit the messages we are seeing on Ruby T.)
  • If we fail to find a text/plain part, the gateway will keep the
    body as is, but force the content-type of the message to text/plain
    in the hopes of getting the content through with some noise (it seems
    this will be needed for very few messages, possibly none)
  • All modified messages will begin with a disclaimer on the first line

James Edward G. II