Mailing List Files (#115)

bbazzarrakk · March 1, 2007, 4:40pm

I’ve been playing a little with TMail lately, which is what really
inspired this
quiz. I thought that a simple solution to this problem would be to pull
the
pages down with open-uri and then dump them into TMail and just pull the
attachments from that. It turns out to be a bit harder to do that than
I
expected, but one solution did follow that path.

What I love about this plan is the fact that you are just stitching the
real
tools together. I like leaning on libraries to get tons of
functionality with
just a few lines of code. Apparently, so does Louis J Scoras! Check
out this
list of dependencies that kick-starts his solution (I’ve removed the
excellent
comments in the code to save space):

#!/usr/bin/env ruby

require ‘action_mailer’
require ‘cgi’
require ‘delegate’
require ‘elif’
require ‘fileutils’
require ‘hpricot’
require ‘open-uri’
require ‘tempfile’

…

Wow.

Let’s start with the standard libraries. Louis pulls in cgi to handle
HTML
escapes, delegate to wrap existing classes, fileutils for easy directory
creation, open-uri to fetch web pages with, and tempfile for creating
temporary
files, of course. That’s an impressive set of tools all of which ship
with
Ruby.

The other three dependancies are external. You can get them all as
gems.
action_mailer is a component of the Rails framework used to handle
email. Louis
doesn’t actually use the action_mailer part, just the bundled TMail
dependency.
This is a trick for getting TMail as a gem.

elif is a little library I wrote as a solution to an earlier quiz (#64).
It
reads files line by line, but in reverse order. In other words, you get
the
last line first, then the next to last line, all the way up to the first
line.

hpricot is a fun little HTML parser from Why the Lucky Stiff. It has a
very
unique interface that makes it popular for web scraping usage.

Now that Louis has imported all the tools he could find, he’s ready to
do some
fetching. Here’s the start of that code:

module Quiz115
class QuizMail < DelegateClass(TMail::Mail)
class << self
attr_reader :archive_base_url

  def archive_base_url
    @archive_base_url ||
    "http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/"
  end

  def solutions(quiz_number)
    doc = Hpricot(
      open("http://www.rubyquiz.com/quiz#{quiz_number}.html")
    )
    (doc/'#links'/'li/a').collect do |link|
      [CGI.unescapeHTML(link.inner_text), link['href']]
    end
  end
end

# ...

This object we are examining now is a TMail enhancement, via delegation.
This
section has some class methods added for easy usability. I believe the
attr_reader line is actually intended to be attr_writer though, giving
you a way
to override the base URL. The reader is defined manually and just
defaults to
the Ruby T. mailing list.

The solutions() method is a neat added feature of the code which will
allows you
to pass in a Ruby Q. number in order to fetch all the solution emails
for that
quiz. Here you can see some Hpricot parsing. Its XPath-in-Ruby style
syntax is
used to pull the solution links off of the quiz page at rubyquiz.com.

Let’s get to the real meat of this class now:

# ...

def initialize(mail)
  temp_path = to_temp_file(mail)
  boundary  = MIME::BoundaryFinder.new(temp_path).find_boundary

  @tmail = TMail::Mail.load(temp_path)
  @tmail.set_content_type 'multipart', 'mixed',
    'boundary' => boundary if boundary

  super(@tmail)
end

private

def to_temp_file(mail)
  temp = Tempfile.new('qmail')

  temp.write(if (Integer(mail) rescue nil)
    url = self.class.archive_base_url + mail
    open(url) { |f| x = cleanse_html f.read }
  else
    web = URI.parse(mail).scheme == 'http'
    open(mail) { |m| web ? cleanse_html(m.read) : m.read }
  end)

  temp.close
  temp.path
end

def cleanse_html(str)
  CGI.unescapeHTML(
    str.gsub(/\A.*?<div id="header">/mi,'').gsub(/<[^>]*>/m, '')
  )
end

end

…

In initialize() the passed mail reference is fetched into a temporary
file and a
special boundary search is performed, which we will examine in detail in
just a
moment. The temp file is then handed off to TMail. After that a
content_type
header is synthesized, as long as we found a boundary.

The actual fetch is made in to_temp_file(). The code that fills the
Tempfile is
a little tricky there, but all is really does is recognize when we are
loading
via the web so it can cleanse_html(). That method just strips the tags
around
the message and unescapes entities.

Now we need to dig into that boundary problem I sidestepped earlier.
The
messages on the web archives are missing their Content-type header and
we need
to restore it in order to get TMail to accept the message. With
messages that
contain attachments, that header should be multipart/mixed. However,
the header
also points to a special boundary string that divides the parts of the
message.
We have to find that string so we can set it in the header.

The next class handles that operation:

…

module MIME
class BoundaryFinder
def initialize(file)
@elif = ::Elif.new(file)
@in_attachment_headers = false
end

  def find_boundary
    while line = @elif.gets
      if @in_attachment_headers
        if boundary = look_for_mime_boundary(line)
          return boundary
        end
      else
        look_for_attachment(line)
      end
    end
    nil
  end

  private

  def look_for_attachment line
    if line =~ /^content-disposition\s*:\s*attachment/i
      puts "Found an attachment" if $DEBUG
      @in_attachment_headers = true
    end
  end

  def look_for_mime_boundary line
    unless line =~ /^\S+\s*:\s*/ || # Not a mail header
           line =~ /^\s+/           # Continuation line?
      puts "I think I found it...#{line}" if $DEBUG
      line.strip.gsub(/^--/, '')
    else
      nil
    end
  end
end

end
end

…

This class is a trivial parser that hunts for the missing boundary. It
uses
Elif to read the file backwards, watching for an attachment to come up.
When it
detects that it is inside an attachment, it switches modes. In the new
mode if
skips over headers and continuation lines until it reaches the first
line that
doesn’t seem to be part of the headers. That’s the boundary.

The rest of the code just put’s these tools to work:

…

include Quiz115
include FileUtils

def process_mail(mailh, outdir)
begin
t = QuizMail.new(mailh)
if t.has_attachments?
t.attachments.each do |attachment|
outpath = File.join(outdir, attachment.original_filename)
puts “\tWriting: #{outpath}”
File.open(outpath, ‘w’) do |out|
out.puts attachment.read
end
end
else
outfile = File.join(outdir, ‘solution.txt’)
File.open(outfile, ‘w’) {|f| f.write t.body}
end
rescue => e
puts “Couldn’t parse mail correctly. Sorry! (E: #{e})”
end
end

def to_dirname(solver)
solver.downcase.delete(‘!#$&*?(){}’).gsub(/\s+/, ‘_’)
end

…

process_mail() builds a QuizMail object out of the passed reference
number, then
copies the attachments from TMail to files in the indicated directory.
If the
message has no attachments, you just get the full message instead.

to_dirname() is a directory name sanitize for when the code in
downloading the
solutions from a quiz, as mentioned earlier.

Here’s the application code:

…

query = ARGV[0]
outdir = ARGV[1] || ‘.’

unless query
$stderr.puts “You must specify either a ruby-talk message id, or a
quiz number (prefixed by ‘q’)”
exit 1
end

if query =~ /\Aq/i
quiz_number = query.sub(/\Aq/i, ‘’)
puts “Fetching all solutions for quiz ##{quiz_number}”

QuizMail.solutions(quiz_number).each do |solver, url|
puts “Fetching solution from #{solver}.”

dirname    = to_dirname(solver)
solver_dir = File.join(outdir, dirname)

mkdir_p solver_dir
process_mail(url, solver_dir)

end
else
process_mail(query, outdir)
end

exit 0

This code just pulls in the arguments, and runs them through one of two
processes. If the number is prefixed with a q, the code scrapes
rubyquiz.com
for that quiz number and pulls all the solutions. It creates a
directory for
each solution, then processes each of those messages. Otherwise, it
handles
just the individual message.

My thanks to those who helped me solve this problem for all quiz fans.
We now
have an excellent resource to share with people who ask about retrieving
the
garbled solutions.

Tomorrow, it’s back to fun and games for the quiz, but this time we’re
on a
search for pure strategy…