Apologies for the late summary… Very interesting techniques in the
submissions!
To begin evaluating the solutions for this quiz, I ran each solution
over the original submission text of each submission. In a way, this
would probably be the most difficult test, seeing as how the solutions
discussed themselves, with sample inputs and outputs. What parts
should be considered code and which text? If I can’t figure it out, I
can’t really expect an algorithm to do that. Still, it was interesting
to see the results; in particular, what I did want to see, if
nothing else, was that the actual code for the submission was
identified as code.
And, except for one line, both worked well at identifying the main
body of code as code… (The solution provided by Jesse Brown
identified it’s own last line – out.close
– as text rather than
code.) The results for the discussion portions of the submissions were
mixed, mostly from false positives. After a quick examination, though,
I believe Frederick C. was more consistent in differentiating
code from commentary.
Perhaps the reason was the approach. As Frederick described:
I got lazy and thought I’d let ruby do the hard work. Given some
text, I feed it through [eval] and see if an exception is raised
or not (this does have limitations).
This seems to be a generally good approach. Most discussion will
contain words or punctuation that I would not expect to pass safely
through Ruby’s parser, and so the number of incorrect categorizations
of text should be minimal. Frederick also notes the sorts of cases
that are identified incorrectly as code; a more complete solution,
perhaps integrating some ideas from solutions such as Jesse’s, would
accommodate those classes. (Anyone want to write a meta-classifier?)
Let’s take a look at Frederick’s solution. We start with initialization:
class CodeExtractor
attr_reader :lines, :output
def initialize(text)
@output = []
@lines = text.split(/[\r\n]/)
end
def extract
while lines.any?
process lines
end
end
# ...
def self.extract(text)
c= CodeExtractor.new text
c.extract
c.format_output
nil
end
end
# Here's how I used Frederick's CodeExtractor class:
CodeExtractor.extract(ARGF)
Pretty straightforward. A new instance of CodeExtractor is created via
the extract
class method, the input text is split on newline and
carriage-return characters and saved in @lines
. Then the instance
method extract
is called, which repeatedly calls process
on all
remaining lines (via the accessor created by attr_reader
) until none
remain.
So what does it mean to process?
def process lines
if (prefix_length = valid_code_prefix_length lines) > 0
prefix_length.times { append_output lines.shift, true }
else
append_output lines.shift, false
end
end
def append_output(line, code)
@output << Struct::Line.new(line, code)
end
A test is performed: valid_code_prefix_length
(we’ll come back to
that). Assuming the test result is positive, we remove that many lines
from @lines
and append them to @output
(via lines.shift
and the
append_output
method). Each line of text moved in this process is
wrapped in the Line
structure, which includes a flag indicating that
these lines are code.
When valid_code_prefix_length
returns zero, only the first line is
moved as such, the flag set to indicate that it is not code.
Eventually, all lines will have been moved from @lines
to @output
,
identified either as code or not.
Let’s continue, looking next at valid_code_prefix_length
:
# returns the maximum number of lines (contiguous from
# the start) that are valid ruby
def valid_code_prefix_length lines
max_valid_lines = 0
code = ""
lines.each_with_index do |line, index|
code << line
code << "\n"
if valid_syntax? code
max_valid_lines = index + 1
end
end
return max_valid_lines
end
valid_code_prefix_length
takes as input an array of lines; in this
case, the input data we are processing. Adding a line at a time to
local variable code
, it calls valid_syntax?
to make a guess as to
whether code
is truly code. The loop (each_with_index
) continues
in this fashion, increasing max_valid_lines
each time around, until
valid_syntax?
returns false. At that time, max_valid_lines
is
returned, indicating how many adjacent lines of code were found.
Finally, we get to valid_syntax?
, the core of Frederick’s code
identification.
def valid_syntax?(code)
io = StringIO.new
original_err, $stderr= $stderr, io
eval("BEGIN {return true}\n#{code}")
raise 'here'
rescue Exception
false
ensure
$stderr = original_err
return false if io.length > 0
end
The first bit here is to create a StringIO
object: an object that
acts like an IO
stream, but holds a string. This is assigned to
$stderr
(while the old value is simultaneously remembered in
original_err
). Reassigning $stderr
allows us to examine any errors
that occur when we call eval
on the next line. You can also see, at
the end of valid_syntax?
, that the original value of $stderr
is
restored, whether any error occurs or not.
So what is going on in that eval
? Obviously, the code to test is
included, but so is BEGIN { return true }
. I admit… I had to
search the manual for this one. What BEGIN
does is to evaluate its
block before anything else in the file (or, in this case, the string)
is evaluated. Since the block is simply return true
, the evaluation
exits before the argument is run.
What this means is that code
is parsed enough to be syntax checked,
but not actually executed. This, of course, matches Frederick’s
description in his submission. If the parse is successful, no
exceptions will be raised, and the { return true }
block will be
executed, causing valid_syntax?
to return true. If the parse fails,
an exception is thrown, which we see provides a return value of false.
(From what I understand, it seems there is some redundancy here…
both in the call to raise
and the check of io.length
… since the
evaluation should either immediately return true from a successful
parse, or it is not successful. Still, perhaps there are cases that
Frederick ran across that necessitates this pattern?)