Code Heuristics (#172)

mansfiem · August 1, 2008, 4:48pm

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The three rules of Ruby Q. 2:

Please do not post any solutions or spoiler discussion for this
quiz until 48 hours have passed from the time on this message.
Support Ruby Q. 2 by submitting ideas as often as you can! (A
permanent, new website is in the works for Ruby Q. 2. Until then,
please visit the temporary website at

http://splatbang.com/rubyquiz/.
Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem
helps everyone on Ruby T. follow the discussion. Please reply to
the original quiz message, if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Code Heuristics (#172)

This week, your task is to make my job simpler.

Each week, coders send in their submissions to Ruby Q. problems,
usually as a mix of quiz discussion and actual code. Your task, then,
is to take a submission as input, and generate output that is the
extracted code. Every line of the input that isn’t code should be
prefixed with the comment marker, #.

Take the following submission to quiz #158:

I may have missed it, but I didn't see this solution:

require 'enumerator'
"Hello world!\n".enum_for(:each_byte).inject($stdout) { |res, b|
  res << b.chr
}

Which in 1.9 looks much better:

"Hello world!\n".each_char.inject($stdout) { |res, c|
  res << c
}

Stefano

Your code should take that as input and generate this output:

# I may have missed it, but I didn't see this solution:

require 'enumerator'
"Hello world!\n".enum_for(:each_byte).inject($stdout) { |res, b|
  res << b.chr
}

# Which in 1.9 looks much better:

"Hello world!\n".each_char.inject($stdout) { |res, c|
  res << c
}

# Stefano

Assuming you knew for certain which lines should be commented and
which should not, your submission would be little more than a
glorified cat. This quiz is more about identifying Ruby-looking
text: what heuristics or patterns will you use to determine what lines
or groups of lines are Ruby code or not.

Fortunately, there is a large sample set from which you can
experiment: all existing Ruby Q. submissions. Unfortunately – for
this quiz, not in general – Ruby is very dynamic and flexible, which
can make code identification a difficult problem. So I am not
expecting perfection here, but I want to see what tricks, or general
process, you can come up with for this problem.

Extra credit: Provide an optional command-line argument to “split” the
output into multiple files. Each separate file would be a different
chunk of the code, where code is separated by discussion. Commented
discussion should be kept with the code that follows it.

mansfiem · August 3, 2008, 5:33am

So, based on the sample test case, do we have to worry about Ruby 1.9 as
well? and the mixed case? That would make this test quite a bit tougher!

mansfiem · August 3, 2008, 5:00pm

On 1 Aug 2008, at 15:45, Matthew M. wrote:

is to take a submission as input, and generate output that is the
extracted code. Every line of the input that isn’t code should be
prefixed with the comment marker, #.

I got lazy and thought I’d let ruby do the hard work. Given some text,
I feed it through

eval(“BEGIN {return true}\n#{code}”, nil)

and see if an exception is raised or not (this does have limitations).
It’s not enough to do this line by line, for example

if foo
puts “bar”
end

the first and last lines are not, on their own, valid ruby, but as a
whole it is of course valid.

For a given chunk of text we first try and find the maximal prefix
(prefix isn’t quite the right word, since we only split at lines) that
is valid ruby.
To do this we take an empty string and just add a line of the input at
a time, running eval each time to see if what we have is valid.

if there is no such prefix, then the first line must be comment text
and so we tag that line as a comment. We remove the line and repeat
processing on the remaining lines.
if there is such a prefix then that prefix is tagged as code, we
remove it and process the remaining lines.

The output formatter sort of does the splitting into separate files -
it prints a mark to the screen where it would split (I was too lazy to
start messing around with files).

What this code doesn’t deal well with is lines with not much text, for
example:

I think this does it:
if foo
bar
end
Fred

The line Fred is marked as code, because that is perfectly legal ruby,
it’s just the value of the constant Fred. Of course that would
probably blowup if you actually evaluated that line but my valid code
detector can’t handle that (I can’t really think how you could handle
this with true certainty without actually executing the code).

Sequences like

hope this helps

also look like legal code (but will produce warnings). To get around
this we require that our evaling produces no warnings (and thus we
trust that ruby quiz submitters squash warnings from their code :-)).
Another limitation was that if you were trying to say ‘this works in
ruby 1.8 but only this works in 1.9’ then this solution would fail if
the 1.9 code used some ruby 1.9 specific bit of syntax (obviously if
the example just uses differences in the standard library that is
irrelevant) and you were running this script on ruby 1.8

Fred

Usage: (example stolen from Mikael Hoilund’s submission to 171)

CodeExtractor.extract(<<TEXT
Oh hi, I just thought I’d golf a solution. I’m sure other people can
do a much better job than I making a full hexdumping suite, so I just
had some fun. Can’t seem to get it lower than 78 characters,
unfortunately.

i=0;$<.read.scan(/.{0,16}/m){puts"%08x "%i+$&.unpack(‘H4’*8).join(’
');i+=16}

Expanded and parenthesified, clarified:

i = 0
ARGF.read.scan(/.{0,16}/m) {
puts(("%08x " % i) + $&.unpack(‘H4’*8).join(’ '))
i += 16
}

ARGF (aliased as $<) is the file handle of all file names given in the
arguments concatenated, STDIN if none — exactly what we need. The
regex to scan matches between 0 and 16 characters (including newline)
greedily. Change it to 1,16 if you don’t want the empty line at the end.

Instead of letting the block to scan take an argument, I used a trick
I picked up from the last Ruby Q. I participated in (Obfuscated
Email), and use $& inside the block, which is the last regex match.
Saves two characters \o/
TEXT
)

produces as output:

#Oh hi, I just thought I’d golf a solution. I’m sure other people can
do a much better job than I making a full hexdumping suite, so I just
had some fun. Can’t seem to get it lower than 78 characters,
unfortunately.

i=0;$<.read.scan(/.{0,16}/m){puts"%08x "%i+$&.unpack(‘H4’*8).join(’
');i+=16}

#Expanded and parenthesified, clarified:

i = 0
ARGF.read.scan(/.{0,16}/m) {
puts(("%08x " % i) + $&.unpack(‘H4’*8).join(’ '))
i += 16
}

#ARGF (aliased as $<) is the file handle of all file names given in
the arguments concatenated, STDIN if exactly what we need. The regex
to scan matches between 0 and 16 characters (including newline)
greedily. Change it to 1,16 if you don’t want the empty line at the
end.none

#Instead of letting the block to scan take an argument, I used a trick
I picked up from the last Ruby Q. I participated in (Obfuscated
Email), and use $& inside the block, which is the last regex match.
Saves two characters o/

The code:

require ‘stringio’
Struct.new ‘Line’, :data, :code
class CodeExtractor
attr_reader :lines, :output

def initialize(text)
@output = []
@lines = text.split(/[\r\n]/)
end

def extract
while lines.any?
process lines
end
end

def valid_syntax?(code)
io = StringIO.new
original_err, $stderr= $stderr, io
eval(“BEGIN {return true}\n#{code}”)
raise ‘here’
rescue Exception
false
ensure
$stderr = original_err
return false if io.length > 0
end

#returns the maximum number of lines (contiguous from the start)
that are valid ruby
def valid_code_prefix_length lines
max_valid_lines = 0
code = “”
lines.each_with_index do |line, index|
code << line
code << “\n”
if valid_syntax? code
max_valid_lines = index + 1
end
end
return max_valid_lines
end

def append_output(line, code)
@output << Struct::Line.new(line, code)
end

def process lines
if (prefix_length = valid_code_prefix_length lines) > 0
prefix_length.times { append_output lines.shift, true }
else
append_output lines.shift, false
end
end

def format_output
last_line = nil
@output.each do |line|
if line.data =~ /^\s*$/
puts “”
next
end
if last_line && last_line.code && !line.code #transition from
code to comment
puts “-------”
end
puts “#{line.code ? ‘’:’#’}#{line.data}”
last_line = line
end
end

def self.extract(text)
c= CodeExtractor.new text
c.extract
c.format_output
nil
end
end

mansfiem · August 7, 2008, 6:36pm

Summary for the Code Heuristics quiz delayed; should be up tomorrow.

mansfiem · August 3, 2008, 5:53pm

Here is my attempt.
I made the huge assumption that most code adheres to some indentation
style. Based on that all I really had to worry about was what lines
would usually not be indented (those that start a new scope).
I also changed the prefixing of the output since lines with # might
actually be part of the code as comments.
After dealing with this in several submissions I decided to allow for
text from a copied email to be removed (usually prefixed with "> ") so
that only the submitters comments were included.
With the provided example…

172 $ ruby rubyish.rb < test.eml
[TEXT] I may have missed it, but I didn’t see this solution:

[CODE] require ‘enumerator’

[CODE] “Hello world!\n”.enum_for(:each_byte).inject($stdout) { |res, b|

[CODE] res << b.chr

[CODE] }

[TEXT] Which in 1.9 looks much better:

[CODE] “Hello world!\n”.each_char.inject($stdout) { |res, c|

[CODE] res << c

[CODE] }

[TEXT] Stefano

rubyish.rb:

#!/usr/bin/env ruby

KEYWORDS = %{
begin do end while for in if else break redo next loop retry ensure
rescue case when exit unless and or not class module new def raise
puts print p
}

THRESHOLD = 0.60

def load_keywords(file)
return KEYWORDS unless file
keywords = []
begin
File.open(file).each do |line|
keywords << line.chomp.split(/\s+/)
end
rescue
keywords = KEYWORDS
end
keywords.flatten
end

Determine if the line looks as though it begins/ends a block of code

def block_statement?(text)

Assignment is assumed to be code

This can lead to several false positives…keep?

return true if text =~ /[^=]+=[^=]/

case text
when /^(def|class|module)/
true
when /^(begin|do|while|ARG.[[.]|for|case|if|loop|puts|print|p[( ])/
true
when /({|do) (|.?|)?\s$/
true
when /^(end|})/
true
else
false
end
end

Assume that symbols are language items. And words that match

a set of keywords are language items. Everything else is text

Whitespace is not considered

Returns [# language tokens, # total tokens]

def separate(text,keywords)
words = text.scan(/\w+/)
s = words.join
symbols = (text.split(//) - s.split(//)).delete_if { |x| x =~ /\s+/ }

special = symbols.size
total = words.size + special

words.each { |w| special += 1 if keywords.include?(w) }

[special,total]
end

def usage
$stderr.puts “USAGE: #{File.basename($0)} [options] [-f ]”
$stderr.puts “-”*40
$stderr.puts " -c "
$stderr.puts " When text is copied from a previous email and
should be"
$stderr.puts " removed, prefix is the start of the copied lines
(ex. -c '> ')"
$stderr.puts " -s [name]"
$stderr.puts " Split comment/code pairs into separate files"
$stderr.puts " If provided, files are named name.N (Default:
submission)"
$stderr.puts " -t "
$stderr.puts " Threshold for assuming a line is code vs.
comment."
$stderr.puts " 0.0 - 1.0 : percent of symbols that need to be
code-like (Default: 0.60)"
$stderr.puts " -k "
$stderr.puts " File containing the whitespace-separated list of
keywords"
$stderr.puts " to use in code-like matching"
$stderr.puts " -p [on|off]"
$stderr.puts " If on, prefix code output with [CODE] and comment"
$stderr.puts " output with [TEXT]. (Default: on)"
exit 42
end

def parse_options
opts = {:prefix => true}
ARGV.join.split(’-’).delete_if { |x| x == “” }.each do |arg|
case arg
when /^c(.+)/
opts[:copied] = $1.strip
when /^s(.+)/
opts[:outfile] = $1.strip
when /^t(.)/
t = $1.strip.to_f
raise “Invalid Threshold #{$1}” if t < 0.0 || t > 1.0
opts[:threshold] = t
when /^k(.+)/
opts[:keyfile] = $1.strip
when /^f(.+)/
opts[:input] = $1.strip
when /^p(.+)/
opts[:prefix] = $1 == “on”
else
usage
end
end
opts
end

Begin execution

opts = parse_options
input = opts[:input] ? File.open(opts[:input]) : $stdin
keywords = load_keywords(opts[:keywords])
threshold = opts[:threshold] || THRESHOLD

Initial classification

classified = []
input.each do |line|

ignore all lines copied from previous emails

next if opts[:copied] && line =~ /^#{opts[:copied]}/
case line
when /^\s*$/
classified << [:blank, line]
when /^\s*#/, /^\s+/, /^require/
classified << [:code, line]
else
classified << [:text, line]
end
end

Make educated guesses based on content of remaining lines

estimated = []
classified.each do |type,line|
case type
when :code, :blank
estimated << [type, line]
when :text
if block_statement?(line)
estimated << [:code, line]
else
# Compare words to ‘guessed’ language characters
special,total = separate(line, keywords)
if special.to_f/total.to_f >= threshold
estimated << [:code,line]
else
estimated << [:text,line]
end
end
end
end

Assume that one line of code surrounded by two non-code lines

is just an example and not part of the actual submission

size = estimated.size
(0…size).each do |i|
next if i < 2
next if i > size - 3
a,,b,,type,line,y,,z = estimated.slice((i-2)…(i+2))
next unless type == :code
next if [a,b,c,d].include? :code
estimated[i] = [:text, line]
end

Output modified submission

n, last, out = 0, nil, $stdout
if opts[:outfile]
file = “%s.%d” % [opts[:outfile], n]
out = File.open(file, ‘w’)
end

estimated.each do |type,line|
case type
when :blank
out.puts
next
when :text
if last == :code && out != $stdout
out.close
file = “%s.%d” % [opts[:outfile], n += 1]
out = File.open(file, ‘w’)
end
prefix = opts[:prefix] ? " [TEXT] " : “”
out.puts prefix + line
when :code
prefix = opts[:prefix] ? " [CODE] " : “”
out.puts prefix + line
end
last = type
end
out.close

mansfiem · August 9, 2008, 1:54am

On Aug 7, 11:34 am, Matthew M. [email protected] wrote:

Summary for the Code Heuristics quiz delayed; should be up tomorrow.

Apologies, things have come up that have prevented me from finishing
the summary. I’ll get it done this weekend. Needless to say, there
won’t be a new quiz this week.