Code Heuristics (#172)

Apologies for the late summary… Very interesting techniques in the
submissions!

To begin evaluating the solutions for this quiz, I ran each solution
over the original submission text of each submission. In a way, this
would probably be the most difficult test, seeing as how the solutions
discussed themselves, with sample inputs and outputs. What parts
should be considered code and which text? If I can’t figure it out, I
can’t really expect an algorithm to do that. Still, it was interesting
to see the results; in particular, what I did want to see, if
nothing else, was that the actual code for the submission was
identified as code.

And, except for one line, both worked well at identifying the main
body of code as code… (The solution provided by Jesse Brown
identified it’s own last line – out.close – as text rather than
code.) The results for the discussion portions of the submissions were
mixed, mostly from false positives. After a quick examination, though,
I believe Frederick C. was more consistent in differentiating
code from commentary.

Perhaps the reason was the approach. As Frederick described:

I got lazy and thought I’d let ruby do the hard work. Given some
text, I feed it through [eval] and see if an exception is raised
or not (this does have limitations).

This seems to be a generally good approach. Most discussion will
contain words or punctuation that I would not expect to pass safely
through Ruby’s parser, and so the number of incorrect categorizations
of text should be minimal. Frederick also notes the sorts of cases
that are identified incorrectly as code; a more complete solution,
perhaps integrating some ideas from solutions such as Jesse’s, would
accommodate those classes. (Anyone want to write a meta-classifier?)

Let’s take a look at Frederick’s solution. We start with initialization:

class CodeExtractor
   attr_reader :lines, :output

   def initialize(text)
     @output = []
     @lines = text.split(/[\r\n]/)
   end

   def extract
     while lines.any?
       process lines
     end
   end

   # ...

   def self.extract(text)
     c= CodeExtractor.new text
     c.extract
     c.format_output
     nil
   end
end

# Here's how I used Frederick's CodeExtractor class:
CodeExtractor.extract(ARGF)

Pretty straightforward. A new instance of CodeExtractor is created via
the extract class method, the input text is split on newline and
carriage-return characters and saved in @lines. Then the instance
method extract is called, which repeatedly calls process on all
remaining lines (via the accessor created by attr_reader) until none
remain.

So what does it mean to process?

   def process lines
     if (prefix_length = valid_code_prefix_length lines) > 0
       prefix_length.times { append_output lines.shift, true }
     else
       append_output lines.shift, false
     end
   end

   def append_output(line, code)
     @output << Struct::Line.new(line, code)
   end

A test is performed: valid_code_prefix_length (we’ll come back to
that). Assuming the test result is positive, we remove that many lines
from @lines and append them to @output (via lines.shift and the
append_output method). Each line of text moved in this process is
wrapped in the Line structure, which includes a flag indicating that
these lines are code.

When valid_code_prefix_length returns zero, only the first line is
moved as such, the flag set to indicate that it is not code.
Eventually, all lines will have been moved from @lines to @output,
identified either as code or not.

Let’s continue, looking next at valid_code_prefix_length:

   # returns the maximum number of lines (contiguous from
   # the start) that are valid ruby
   def valid_code_prefix_length lines
     max_valid_lines = 0
     code = ""
     lines.each_with_index do |line, index|
       code << line
       code << "\n"
       if valid_syntax? code
         max_valid_lines = index + 1
       end
     end
     return max_valid_lines
   end

valid_code_prefix_length takes as input an array of lines; in this
case, the input data we are processing. Adding a line at a time to
local variable code, it calls valid_syntax? to make a guess as to
whether code is truly code. The loop (each_with_index) continues
in this fashion, increasing max_valid_lines each time around, until
valid_syntax? returns false. At that time, max_valid_lines is
returned, indicating how many adjacent lines of code were found.

Finally, we get to valid_syntax?, the core of Frederick’s code
identification.

   def valid_syntax?(code)
     io = StringIO.new
     original_err, $stderr= $stderr, io
     eval("BEGIN {return true}\n#{code}")
     raise 'here'
   rescue Exception
     false
   ensure
     $stderr = original_err
     return false if io.length > 0
   end

The first bit here is to create a StringIO object: an object that
acts like an IO stream, but holds a string. This is assigned to
$stderr (while the old value is simultaneously remembered in
original_err). Reassigning $stderr allows us to examine any errors
that occur when we call eval on the next line. You can also see, at
the end of valid_syntax?, that the original value of $stderr is
restored, whether any error occurs or not.

So what is going on in that eval? Obviously, the code to test is
included, but so is BEGIN { return true }. I admit… I had to
search the manual for this one. What BEGIN does is to evaluate its
block before anything else in the file (or, in this case, the string)
is evaluated. Since the block is simply return true, the evaluation
exits before the argument is run.

What this means is that code is parsed enough to be syntax checked,
but not actually executed
. This, of course, matches Frederick’s
description in his submission. If the parse is successful, no
exceptions will be raised, and the { return true } block will be
executed, causing valid_syntax? to return true. If the parse fails,
an exception is thrown, which we see provides a return value of false.

(From what I understand, it seems there is some redundancy here…
both in the call to raise and the check of io.length… since the
evaluation should either immediately return true from a successful
parse, or it is not successful. Still, perhaps there are cases that
Frederick ran across that necessitates this pattern?)

On 10 Aug 2008, at 23:28, Matthew M. wrote:

(From what I understand, it seems there is some redundancy here…
both in the call to raise and the check of io.length… since the
evaluation should either immediately return true from a successful
parse, or it is not successful. Still, perhaps there are cases that
Frederick ran across that necessitates this pattern?)

The call to raise is just something left in there by accident -
nothing interesting there :slight_smile:
Checking io.length is to check that no warnings were present.
Syntactically something like

hi how are you

is valid, but you will get warnings about how you should parenthesize
stuff.

Fred