I’ve been playing a little with TMail lately, which is what really
inspired this
quiz. I thought that a simple solution to this problem would be to pull
the
pages down with open-uri and then dump them into TMail and just pull the
attachments from that. It turns out to be a bit harder to do that than
I
expected, but one solution did follow that path.
What I love about this plan is the fact that you are just stitching the
real
tools together. I like leaning on libraries to get tons of
functionality with
just a few lines of code. Apparently, so does Louis J Scoras! Check
out this
list of dependencies that kick-starts his solution (I’ve removed the
excellent
comments in the code to save space):
#!/usr/bin/env ruby
require ‘action_mailer’
require ‘cgi’
require ‘delegate’
require ‘elif’
require ‘fileutils’
require ‘hpricot’
require ‘open-uri’
require ‘tempfile’
…
Wow.
Let’s start with the standard libraries. Louis pulls in cgi to handle
HTML
escapes, delegate to wrap existing classes, fileutils for easy directory
creation, open-uri to fetch web pages with, and tempfile for creating
temporary
files, of course. That’s an impressive set of tools all of which ship
with
Ruby.
The other three dependancies are external. You can get them all as
gems.
action_mailer is a component of the Rails framework used to handle
email. Louis
doesn’t actually use the action_mailer part, just the bundled TMail
dependency.
This is a trick for getting TMail as a gem.
elif is a little library I wrote as a solution to an earlier quiz (#64).
It
reads files line by line, but in reverse order. In other words, you get
the
last line first, then the next to last line, all the way up to the first
line.
hpricot is a fun little HTML parser from Why the Lucky Stiff. It has a
very
unique interface that makes it popular for web scraping usage.
Now that Louis has imported all the tools he could find, he’s ready to
do some
fetching. Here’s the start of that code:
module Quiz115
class QuizMail < DelegateClass(TMail::Mail)
class << self
attr_reader :archive_base_url
def archive_base_url
@archive_base_url ||
"http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/"
end
def solutions(quiz_number)
doc = Hpricot(
open("http://www.rubyquiz.com/quiz#{quiz_number}.html")
)
(doc/'#links'/'li/a').collect do |link|
[CGI.unescapeHTML(link.inner_text), link['href']]
end
end
end
# ...
This object we are examining now is a TMail enhancement, via delegation.
This
section has some class methods added for easy usability. I believe the
attr_reader line is actually intended to be attr_writer though, giving
you a way
to override the base URL. The reader is defined manually and just
defaults to
the Ruby T. mailing list.
The solutions() method is a neat added feature of the code which will
allows you
to pass in a Ruby Q. number in order to fetch all the solution emails
for that
quiz. Here you can see some Hpricot parsing. Its XPath-in-Ruby style
syntax is
used to pull the solution links off of the quiz page at rubyquiz.com.
Let’s get to the real meat of this class now:
# ...
def initialize(mail)
temp_path = to_temp_file(mail)
boundary = MIME::BoundaryFinder.new(temp_path).find_boundary
@tmail = TMail::Mail.load(temp_path)
@tmail.set_content_type 'multipart', 'mixed',
'boundary' => boundary if boundary
super(@tmail)
end
private
def to_temp_file(mail)
temp = Tempfile.new('qmail')
temp.write(if (Integer(mail) rescue nil)
url = self.class.archive_base_url + mail
open(url) { |f| x = cleanse_html f.read }
else
web = URI.parse(mail).scheme == 'http'
open(mail) { |m| web ? cleanse_html(m.read) : m.read }
end)
temp.close
temp.path
end
def cleanse_html(str)
CGI.unescapeHTML(
str.gsub(/\A.*?<div id="header">/mi,'').gsub(/<[^>]*>/m, '')
)
end
end
…
In initialize() the passed mail reference is fetched into a temporary
file and a
special boundary search is performed, which we will examine in detail in
just a
moment. The temp file is then handed off to TMail. After that a
content_type
header is synthesized, as long as we found a boundary.
The actual fetch is made in to_temp_file(). The code that fills the
Tempfile is
a little tricky there, but all is really does is recognize when we are
loading
via the web so it can cleanse_html(). That method just strips the tags
around
the message and unescapes entities.
Now we need to dig into that boundary problem I sidestepped earlier.
The
messages on the web archives are missing their Content-type header and
we need
to restore it in order to get TMail to accept the message. With
messages that
contain attachments, that header should be multipart/mixed. However,
the header
also points to a special boundary string that divides the parts of the
message.
We have to find that string so we can set it in the header.
The next class handles that operation:
…
module MIME
class BoundaryFinder
def initialize(file)
@elif = ::Elif.new(file)
@in_attachment_headers = false
end
def find_boundary
while line = @elif.gets
if @in_attachment_headers
if boundary = look_for_mime_boundary(line)
return boundary
end
else
look_for_attachment(line)
end
end
nil
end
private
def look_for_attachment line
if line =~ /^content-disposition\s*:\s*attachment/i
puts "Found an attachment" if $DEBUG
@in_attachment_headers = true
end
end
def look_for_mime_boundary line
unless line =~ /^\S+\s*:\s*/ || # Not a mail header
line =~ /^\s+/ # Continuation line?
puts "I think I found it...#{line}" if $DEBUG
line.strip.gsub(/^--/, '')
else
nil
end
end
end
end
end
…
This class is a trivial parser that hunts for the missing boundary. It
uses
Elif to read the file backwards, watching for an attachment to come up.
When it
detects that it is inside an attachment, it switches modes. In the new
mode if
skips over headers and continuation lines until it reaches the first
line that
doesn’t seem to be part of the headers. That’s the boundary.
The rest of the code just put’s these tools to work:
…
include Quiz115
include FileUtils
def process_mail(mailh, outdir)
begin
t = QuizMail.new(mailh)
if t.has_attachments?
t.attachments.each do |attachment|
outpath = File.join(outdir, attachment.original_filename)
puts “\tWriting: #{outpath}”
File.open(outpath, ‘w’) do |out|
out.puts attachment.read
end
end
else
outfile = File.join(outdir, ‘solution.txt’)
File.open(outfile, ‘w’) {|f| f.write t.body}
end
rescue => e
puts “Couldn’t parse mail correctly. Sorry! (E: #{e})”
end
end
def to_dirname(solver)
solver.downcase.delete(’!#$&*?(){}’).gsub(/\s+/, ‘_’)
end
…
process_mail() builds a QuizMail object out of the passed reference
number, then
copies the attachments from TMail to files in the indicated directory.
If the
message has no attachments, you just get the full message instead.
to_dirname() is a directory name sanitize for when the code in
downloading the
solutions from a quiz, as mentioned earlier.
Here’s the application code:
…
query = ARGV[0]
outdir = ARGV[1] || ‘.’
unless query
$stderr.puts “You must specify either a ruby-talk message id, or a
quiz number (prefixed by ‘q’)”
exit 1
end
if query =~ /\Aq/i
quiz_number = query.sub(/\Aq/i, ‘’)
puts “Fetching all solutions for quiz ##{quiz_number}”
QuizMail.solutions(quiz_number).each do |solver, url|
puts “Fetching solution from #{solver}.”
dirname = to_dirname(solver)
solver_dir = File.join(outdir, dirname)
mkdir_p solver_dir
process_mail(url, solver_dir)
end
else
process_mail(query, outdir)
end
exit 0
This code just pulls in the arguments, and runs them through one of two
processes. If the number is prefixed with a q, the code scrapes
rubyquiz.com
for that quiz number and pulls all the solutions. It creates a
directory for
each solution, then processes each of those messages. Otherwise, it
handles
just the individual message.
My thanks to those who helped me solve this problem for all quiz fans.
We now
have an excellent resource to share with people who ask about retrieving
the
garbled solutions.
Tomorrow, it’s back to fun and games for the quiz, but this time we’re
on a
search for pure strategy…