Mailing List Files (#115)

The three rules of Ruby Q.:

  1. Please do not post any solutions or spoiler discussion for this quiz
    until
    48 hours have passed from the time on this message.

  2. Support Ruby Q. by submitting ideas as often as you can:

http://www.rubyquiz.com/

  1. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby T. follow the discussion. Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

The Ruby T. mailing list archives will show files attached to incoming
messages. However, it’s not always easy to get at the data from these
files
using the archives alone. The attachments are sometimes displayed in
not-too-readable formats:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/190780

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/226884

This is tough for those of us who like to play with Ruby Q. solutions.

This week’s quiz is to write a program that takes a message id number as
a
command-line argument and “downloads” any attachments from that message.
Assume
message ids are for Ruby T. posts by default, but you may want to
provide an
option to override that so we can support lists like Ruby Core as well.

If no path is given, write the attachments to the working directory.
When there
is a path, your code should place the files there instead.

Does anyone have some good sample messages?

Aside from 190780 and 226884 I’ve been testing with 66854 and 63060
that cover some cases not in 190780 and 226884.

Does anyone have some nice tricky attachments to test with?

/Christoffer

Much longer than Brian’s submission but here goes:

#!/usr/bin/env ruby -w

require ‘getoptlong’
require “net/http”
require “base64”

opts = GetoptLong.new([ ‘–help’, ‘-h’, GetoptLong::NO_ARGUMENT],
[ ‘–path’, ‘-p’, GetoptLong::REQUIRED_ARGUMENT],
[ ‘–url’, ‘-u’, GetoptLong::REQUIRED_ARGUMENT],
[ ‘–debug’, ‘-d’, GetoptLong::NO_ARGUMENT])

def print_usage_and_exit
puts “Usage: #{File.basename($PROGRAM_NAME)} [switches] message-id”
puts " -p directory set save directory to directory"
puts " -u url set url to use to url"
puts " -d display all decoded data as it is read"
exit 0
end

class String

def strip_html
string.dup.strip_html!
end

def decode_quoted_printable
decoded_string = gsub(/=…/) { |code| code[1…2].hex.chr }
strip_last = decoded_string.rstrip
strip_last[-1] == ?= ? strip_last.chop! : decoded_string
end

def strip_html!
gsub!(/<.?>/, ‘’)
gsub!(/&.
?;/) do |match|
case match
when “&” then ‘&’
when “”" then ‘"’
when “>” then ‘>’
when “<” then ‘<’
when /&#\d+;/ then match[/(\d)+/].to_i.chr
when /&#x[0-9a-fA-F]+;/ then match[/[0-9a-fA-F]+/].hex.chr
else match
end
end
self
end

end

class WaitingState
def process(line)
return self unless line =~ /^–/
HeaderReadingState.new(line.strip)
end
end

WAITING_STATE = WaitingState.new

State reading content description

class HeaderReadingState

def initialize(line)
@line = line.strip
@data = {}
@entry = nil
end

def process(line)

 line.strip!

 # Ignore this attachment if we only have content lines.
 return WAITING_STATE if line[@line]

 # Switch to reading attachment-data when we encounter an empty

line.
return AttachmentParsingState.new(@line, @data) if line.empty?

 # If we have an entry-header, handle this.
 if line =~ /.*:/
   @entry = line.slice!(/^.*:/)
   @entry.chop!
 end

 # Invalid attachment
 return WAITING_STATE if @entry.nil? && !line.empty?

 unless line.empty? then
   entry = @entry.downcase
   data = line.strip
   if data[-1] == ?;
     # More data on next line, so just chop the ; and keep the

same entry
data.chop!
else
# Last data for this entry, so make sure next line has an
entry.
@entry = nil
end
# Data for each entry is stored as an array.
@data[entry] = (@data[entry] || []) + data.split(/;/).collect
{ |part| part.strip }
end

 # Stay in this state
 self

end

end

State for reading attachment content.

class AttachmentParsingState

IDENTITY_DECODING = lambda { |string| string }

QUOTED_PRINTABLE_DECODING = lambda { |string|
string.decode_quoted_printable }

BASE64_DECODING = lambda { |string| Base64.decode64(string) }

ENCODINGS = { ‘base64’ => BASE64_DECODING,
‘quoted-printable’ => QUOTED_PRINTABLE_DECODING }

def initialize(line, data)
@line = line

 # Determine the encoding of the content.
 encoding = ((data["content-transfer-encoding"] || []).first ||

“none”).downcase

 # Select a decoding and default to identity decoding.
 @decoding = ENCODINGS[encoding.downcase]

 # Check content-disposition if this is an attachement.
 # If so, extract the filename.
 disposition = data["content-disposition"]
 if disposition
   @filename = parse_filename(disposition)
   puts "Found attachment #{@filename} with encoding '#

{encoding}’." if @filename
puts “No decoder found for ‘#{encoding}’ - decoding turned
off.” unless @decoding
else
@filename = nil
end

 @data = ""

end

Parse out a possible filename from content-disposition.

def parse_filename(disposition)
return nil unless disposition.member?(“attachment”)
filename = disposition.find("") { |value| value =~ /filename/ }
filename.slice!(“filename=”)
filename.strip!
filename = eval(filename) if filename =~ /".*"/
filename.empty? ? nil : filename
end

def store_attachment
if @filename then
filename = File.join($file_path, @filename)
if File.exist? filename
puts “Extraction done: #{filename} already exists - skipping.”
else
File.open(filename, “w+”) { |file| file.print @data }
puts “Extraction done: Attachment saved as ‘#{filename}’”
end
end
end

Process a line

def process(line)

 if line[@line]
   # store the data we got this far.
   store_attachment

   # We hit a delimiter, so go back to header reading state.
   return HeaderReadingState.new(line)
 end

 # Decode and store data.
 decoded = @decoding ? @decoding.call(line) : line
 print decoded if $debug
 @data << decoded

 # stay in this state
 self

end

end

def save_attachment(host, path, index)
state = WAITING_STATE
Net::HTTP.get(host, path + index.to_s).strip_html!.each do |str|
state = state.process(str)
end
end

$file_path = “.”
$debug = false
host = “blade.nagaokaut.ac.jp”
path = “/cgi-bin/scat.rb/ruby/ruby-talk/”

opts.each do |opt, arg|
case opt
when ‘–help’
print_usage_and_exit
when ‘–path’
if File.directory?(arg)
$file_path = arg
else
puts “Illegal path ‘#{arg}’ - Aborting.”
exit 0
end
when ‘–debug’
$debug = true
when ‘–url’
url = arg.gsub(/.:///, ‘’)
path = url[//.
$/]
path += “/” unless path[-1] == ?/
url.slice!(path)
host = url
end
end

print_usage_and_exit if ARGV.length != 1

message_id = ARGV.first.to_i

save_attachment(host, path, message_id)

#!/usr/bin/env ruby

q115.rb - solution to rubyquiz #115 (Mailing List Files)

Lou S. [email protected]

February 28, 2007

= Dependancies

It felt like I was cheating a lot in this quiz since I made use of

several

great libraries to do everything for me =) If you want to play with

the

script, you’ll need to get a hold of:

ActionMailer:: This was used for access to TMail. You might be able

to use

TMail by itself, but I haven’t tested it and rails

might

have made some modifications.

Elif:: This handy little library reads files backwards. This

was

actually a solution from a previous quiz ({64 - Port a

Library}[Ruby Quiz - Port a Library (#64)]). Plus

it’s

from James so you know it’s good stuff :wink:

Hpricot:: Used this little gem (no not the kind of package) to

do the

scraping to get all the solutions for a quiz.

Awesome, just

awesome!

= The Script

The messages in the archive are pretty close to being readable by

TMail.

Each page is just missing the correct mime header to let the mail

parser

know it’s actually got attachments.

After pulling out all the html artifacts, we still need to find the

mime

boundary. An easy way to do this is just look for the

content-disposition

headers for the attachments and then look above them to find the

boundary.

1. Look for ‘Content-Disposition: attachment’

2. Look for the first line above that which is not a mail header –

that’s

what elif is helping with.

3. That line is the mime boundary. Add the header into the TMail

object and

then you can read the attachments as normal

= Running

The script implements the command line interface mentioned in the quiz

description. You just give it the name of a ruby-talk message id and

it

will fetch the attachments into the current directory. If you follow

the

number by a path you can change the output directory.

$ q115 190780 outdir

As an additional feature, you can also provide the number of the quiz

prefixed with a ‘q’ character. In this case, all of the solutions

will be

downloaded and put in a subdirectory by solver. If the solution

didn’t have

any attachments it puts the message body into a file called

solution.txt.

require ‘action_mailer’
require ‘cgi’
require ‘delegate’
require ‘elif’
require ‘fileutils’
require ‘hpricot’
require ‘open-uri’
require ‘tempfile’

module Quiz115
class QuizMail < DelegateClass(TMail::Mail)
class << self
attr_reader :archive_base_url

 def archive_base_url
   @archive_base_url ||

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/
end

 def solutions(quiz_number)
   doc   = 

Hpricot(open(“http://www.rubyquiz.com/quiz#{quiz_number}.html”))
(doc/‘#links’/‘li/a’).collect do |link|
[CGI.unescapeHTML(link.inner_text), link[‘href’]]
end
end
end

def initialize(mail)
temp_path = to_temp_file(mail)
boundary = MIME::BoundaryFinder.new(temp_path).find_boundary

 @tmail = TMail::Mail.load(temp_path)
 @tmail.set_content_type 'multipart', 'mixed',
   'boundary' => boundary if boundary

 super(@tmail)

end

private

def to_temp_file(mail)
temp = Tempfile.new(‘qmail’)

 temp.write(if (Integer(mail) rescue nil)
   url = self.class.archive_base_url + mail
   open(url) { |f| x = cleanse_html f.read }
 else
   web = URI.parse(mail).scheme == 'http'
   open(mail) { |m| web ? cleanse_html(m.read) : m.read }
 end)

 temp.close
 temp.path

end

def cleanse_html(str)
CGI.unescapeHTML(str.gsub(/\A.?

/mi,‘’).gsub(/<[^>]>/m, ‘’))
end
end

module MIME
class BoundaryFinder

 ##
 # Create a parser to find the mime boundary
 #
 def initialize(file)
   @elif = ::Elif.new(file)
   @in_attachment_headers = false
 end

 ##
 # Find the mime boundary marker.  Only returns the marker if itcan 

find an
# attachment, otherwise for quiz purposes there’s no reason to find
it: id
# est we don’t care about multipart/alternative messages, et
cetera.
#
def find_boundary
while line = @elif.gets
if @in_attachment_headers
if boundary = look_for_mime_boundary(line)
return boundary
end
else
look_for_attachment(line)
end
end
nil
end

 private

 def look_for_attachment line
   if line =~ /^content-disposition\s*:\s*attachment/i
     puts "Found an attachment" if $DEBUG
     @in_attachment_headers = true
   end
 end

 def look_for_mime_boundary line
   unless line =~ /^\S+\s*:\s*/ || # Not a mail header
          line =~ /^\s+/           # Continuation line?
     puts "I think I found it...#{line}" if $DEBUG
     line.strip.gsub(/^--/, '')
   else
     nil
   end
 end

end
end
end

include Quiz115
include FileUtils

def process_mail(mailh, outdir)
begin
t = QuizMail.new(mailh)
if t.has_attachments?
t.attachments.each do |attachment|
outpath = File.join(outdir, attachment.original_filename)
puts “\tWriting: #{outpath}”
File.open(outpath, ‘w’) do |out|
out.puts attachment.read
end
end
else
outfile = File.join(outdir, ‘solution.txt’)
File.open(outfile, ‘w’) {|f| f.write t.body}
end
rescue => e
puts “Couldn’t parse mail correctly. Sorry! (E: #{e})”
end
end

def to_dirname(solver)
solver.downcase.delete(‘!#$&*?(){}’).gsub(/\s+/, ‘_’)
end

query = ARGV[0]
outdir = ARGV[1] || ‘.’

unless query
$stderr.puts “You must specify either a ruby-talk message id, or a
quiz number (prefixed by ‘q’)”
exit 1
end

if query =~ /\Aq/i
quiz_number = query.sub(/\Aq/i, ‘’)
puts “Fetching all solutions for quiz ##{quiz_number}”

QuizMail.solutions(quiz_number).each do |solver, url|
puts “Fetching solution from #{solver}.”

dirname = to_dirname(solver)
solver_dir = File.join(outdir, dirname)

mkdir_p solver_dir
process_mail(url, solver_dir)
end
else
process_mail(query, outdir)
end

exit 0

#!/usr/bin/env ruby

require ‘net/http’
require ‘strscan’
require ‘cgi’

class GetAttachments
def initialize(id)
@id = id
@url = “blade.nagaokaut.ac.jp”
@params = “/cgi-bin/scat.rb/ruby/ruby-talk/” + @id
@attachments = Array.new
end

def store_attachments
# get the attachment, then store it.
self.fetch_attachments
self.save_attachments
end

def fetch_attachments
# get the page and extract email from pre tags
@page = Net::HTTP.get(@url, @params)
@page =~ /<pre>(.+)</pre>/im
@email = $1
# get rid of everything before the first part separator
# NB boundary separators assumed to start with – No RFC
guarantee this is always right.
@email.sub!(/\A([^-]|-[^-])+/m, ‘’)
# create a scanner and grab header / body pairs
@mime_scanner = StringScanner.new(@email)
# this regex looks for a boundary line beginning – then a line
beginning Content then other header stuff then a blank line then body
stuff
# then either another of the same or a boundary then an empty
line. Lookahead ?= prevents using part of next token.
while @mime_scanner.scan(/(^–.+?\nContent.?^\s$)(.?)(?=^–.+?
\n(Content|^\s
$))/im) do
attachment = Hash.new
# translate html escapes and get rid of html mark-up that
seems to creep into body, plus starting and trailing spaces
attachment[:header] = CGI.unescapeHTML( @mime_scanner[1] )
attachment[:body] = CGI.unescapeHTML( @mime_scanner[2].gsub(/\A
\s+/,’’).chomp.gsub(/<[^>]*>/, ‘’) )
@attachments = @attachments << attachment
end
end

def save_attachments
@attachments.each do |a|
# skip parts that aren’t attachments
next if !(a[:header] =~ /Content-Disposition:\sattachment/i)
# grab file name and encoding.
# quit with error if no filename.
if ( a[:header] =~ /filename\s
=\s*"?([a-z-_\ 0-9.%$@
!]+)"?\s*(\n|;)/i || a[:header] =~ /name\s*=\s*"?([a-z-_\ 0-9.
%$]+)"?\s*(\n|;)/i )
# do above as || to favor filename over name, which may be
unnecessary
# NB hasty assumptions about file name characters
filename = $1
else
puts “Could not parse filename for attachment from #{a
[:header]}”
exit 1
end
if ( a[:header] =~ /Content-Transfer-Encoding:\s*"?([a-z-
_0-9]+)"?\s*?(\n|;)/i )
encoding = $1
end
# if the filename specifies a directory and it exists, use it.
Otherwise just put in pwd.
# NB clobbers any files with same name as attachment.
if ( File.exist?(File.dirname(filename)) )
file = File.new(filename, “w+”)
else
file = File.new(filename = File.basename(filename), “w+”)
end
# decode if necessary
case encoding
when /base64/i
file << a[:body].unpack(“m”).first
when /quoted-printable/i
file << a[:body].unpack(“M”).first
else
file << a[:body]
end
# notify what’s been done, clean up and go home
file.close
puts “Stored attachment from message #{@id} at #{@url} in #
{File.expand_path(filename)}”
exit 0
end
end
end

ARGV.each do |arg|
@ga = GetAttachments.new(arg)
@ga.store_attachments
end



John B.