Code Heuristics (#172)

Apologies for the late summary… Very interesting techniques in the
submissions!

To begin evaluating the solutions for this quiz, I ran each solution
over the original submission text of each submission. In a way, this
would probably be the most difficult test, seeing as how the solutions
discussed themselves, with sample inputs and outputs. What parts
should be considered code and which text? If I can’t figure it out, I
can’t really expect an algorithm to do that. Still, it was interesting
to see the results; in particular, what I did want to see, if
nothing else, was that the actual code for the submission was
identified as code.

And, except for one line, both worked well at identifying the main
body of code as code… (The solution provided by Jesse Brown
identified it’s own last line – `out.close` – as text rather than
code.) The results for the discussion portions of the submissions were
mixed, mostly from false positives. After a quick examination, though,
I believe Frederick C. was more consistent in differentiating
code from commentary.

Perhaps the reason was the approach. As Frederick described:

I got lazy and thought I’d let ruby do the hard work. Given some
text, I feed it through [eval] and see if an exception is raised
or not (this does have limitations).

This seems to be a generally good approach. Most discussion will
contain words or punctuation that I would not expect to pass safely
through Ruby’s parser, and so the number of incorrect categorizations
of text should be minimal. Frederick also notes the sorts of cases
that are identified incorrectly as code; a more complete solution,
perhaps integrating some ideas from solutions such as Jesse’s, would
accommodate those classes. (Anyone want to write a meta-classifier?)

``````class CodeExtractor

def initialize(text)
@output = []
@lines = text.split(/[\r\n]/)
end

def extract
while lines.any?
process lines
end
end

# ...

def self.extract(text)
c= CodeExtractor.new text
c.extract
c.format_output
nil
end
end

# Here's how I used Frederick's CodeExtractor class:
CodeExtractor.extract(ARGF)
``````

Pretty straightforward. A new instance of CodeExtractor is created via
the `extract` class method, the input text is split on newline and
carriage-return characters and saved in `@lines`. Then the instance
method `extract` is called, which repeatedly calls `process` on all
remaining lines (via the accessor created by `attr_reader`) until none
remain.

So what does it mean to process?

``````   def process lines
if (prefix_length = valid_code_prefix_length lines) > 0
prefix_length.times { append_output lines.shift, true }
else
append_output lines.shift, false
end
end

def append_output(line, code)
@output << Struct::Line.new(line, code)
end
``````

A test is performed: `valid_code_prefix_length` (we’ll come back to
that). Assuming the test result is positive, we remove that many lines
from `@lines` and append them to `@output` (via `lines.shift` and the
`append_output` method). Each line of text moved in this process is
wrapped in the `Line` structure, which includes a flag indicating that
these lines are code.

When `valid_code_prefix_length` returns zero, only the first line is
moved as such, the flag set to indicate that it is not code.
Eventually, all lines will have been moved from `@lines` to `@output`,
identified either as code or not.

Let’s continue, looking next at `valid_code_prefix_length`:

``````   # returns the maximum number of lines (contiguous from
# the start) that are valid ruby
def valid_code_prefix_length lines
max_valid_lines = 0
code = ""
lines.each_with_index do |line, index|
code << line
code << "\n"
if valid_syntax? code
max_valid_lines = index + 1
end
end
return max_valid_lines
end
``````

`valid_code_prefix_length` takes as input an array of lines; in this
case, the input data we are processing. Adding a line at a time to
local variable `code`, it calls `valid_syntax?` to make a guess as to
whether `code` is truly code. The loop (`each_with_index`) continues
in this fashion, increasing `max_valid_lines` each time around, until
`valid_syntax?` returns false. At that time, `max_valid_lines` is
returned, indicating how many adjacent lines of code were found.

Finally, we get to `valid_syntax?`, the core of Frederick’s code
identification.

``````   def valid_syntax?(code)
io = StringIO.new
original_err, \$stderr= \$stderr, io
eval("BEGIN {return true}\n#{code}")
raise 'here'
rescue Exception
false
ensure
\$stderr = original_err
return false if io.length > 0
end
``````

The first bit here is to create a `StringIO` object: an object that
acts like an `IO` stream, but holds a string. This is assigned to
`\$stderr` (while the old value is simultaneously remembered in
`original_err`). Reassigning `\$stderr` allows us to examine any errors
that occur when we call `eval` on the next line. You can also see, at
the end of `valid_syntax?`, that the original value of `\$stderr` is
restored, whether any error occurs or not.

So what is going on in that `eval`? Obviously, the code to test is
included, but so is `BEGIN { return true }`. I admit… I had to
search the manual for this one. What `BEGIN` does is to evaluate its
block before anything else in the file (or, in this case, the string)
is evaluated. Since the block is simply `return true`, the evaluation
exits before the argument is run.

What this means is that `code` is parsed enough to be syntax checked,
but not actually executed
. This, of course, matches Frederick’s
description in his submission. If the parse is successful, no
exceptions will be raised, and the `{ return true }` block will be
executed, causing `valid_syntax?` to return true. If the parse fails,
an exception is thrown, which we see provides a return value of false.

(From what I understand, it seems there is some redundancy here…
both in the call to `raise` and the check of `io.length`… since the
evaluation should either immediately return true from a successful
parse, or it is not successful. Still, perhaps there are cases that
Frederick ran across that necessitates this pattern?)

On 10 Aug 2008, at 23:28, Matthew M. wrote:

(From what I understand, it seems there is some redundancy here…
both in the call to `raise` and the check of `io.length`… since the
evaluation should either immediately return true from a successful
parse, or it is not successful. Still, perhaps there are cases that
Frederick ran across that necessitates this pattern?)

The call to raise is just something left in there by accident -
nothing interesting there
Checking io.length is to check that no warnings were present.
Syntactically something like

hi how are you

is valid, but you will get warnings about how you should parenthesize
stuff.

Fred