Statistician I (#167)

mansfiem · July 3, 2008, 7:23pm

The heart of this problem, as suggested in the quiz description, is
pattern
matching. Essentially, we want to turn user-created rules into Ruby
regular
expressions, or at least some other method for comparing data against
those
rules.

I’ll come back to that in a minute. If we assume, for the moment, that
we
have the pattern matching in place, the rest of the code is pretty
trivial.
Read and parse the rules; then, read and parse the data according to
those
rules. Most of the submissions were pretty similar in this respect.
Here’s
the main loop from the solution of Benjamin Billian, as an example.

rules = [] # Set of Rules

# get Rules
File.open ARGV[0], 'r' do |file|
  file.each_line {|line| rules << create_rule(line)}
end

All lines of the first file (containing the rules) are read and passed
through the function create_rule. Those rules are then stored in an
array
to be used in the next section.

unmatched = [] # Array of unmatched lines

# get Data
File.open ARGV[1], 'r' do |file|
  file.each_line do |line|
    matched = false
    rules.each_with_index do |rule, i|
      if (match = rule.match(line)) != nil
        l = match.length
        a = match[1...l]
        a.delete nil
        puts "Rule #{i}: #{a.join ', '}"
        matched = true
        break
      end
    end
    unless matched
      unmatched << line
      puts '# No Match'
    end
  end
end

Now, all lines of the second file (containing the data) are read and
compared against each of the rules. When a match is found, the field
data is
output along with the rule’s index. If no match is found, the input line
is
stored in the unmatched array, to be used at the end:

puts '-----'
puts 'Unmatched input:'
unmatched.each { |line| puts line }

That completes the quiz’s specification. There are a couple minor
revisions
I’d make to Benjamin’s inner loop, my revision shown here:

    rules.each_with_index do |rule, i|
      if match = rule.match(line)
        a = match.captures
        a.delete nil
        puts "Rule #{i}: #{a.join ', '}"
        matched = true
        break
      end
    end

First, the match != nil test is redundant, since nil is false in
that
context; removing the comparison removes clutter. Second, the
MatchData
method captures gets the values wanted in a single method call.
Another
alternative would have been to use [1..-1] to avoid having to get the
match length.

Finally, one alternative to using each_with_index (which requires that
you
keep some sort of tracking variable to know if a match was found) is
using
Enumerable#find, as shown Matthew M.'s submission.

Now, back to pattern matching. While I commend the efforts of those who
wrote rule parsers, I would also recommend to the same that they learn
to
use regular expressions, or at least review their knowledge. The rule
format
as specified by the quiz is intentionally simple, and takes little
effort to
create a regular expression that is compact and almost certainly more
efficient than parsing by hand.

As a first example of how to create an appropriate regular expression
from
an input rule, let’s look again at Benjamin’s solution, his
create_rule
method:

create_rule(str)
  str.gsub! '.', '\.'
  str.gsub! '[', '(?:'  # change [text] into (?:text)?
  str.gsub! ']', ')?'
  str.gsub! /<.+?>/, '(.+?)'  # change <text> into (.+?)
  Regexp.new "^#{str}$"
end

Each substitution here turns some portion of the rule into a Regexp.
First, periods are escaped, since they have special meaning in a
Regexp,
while we want a literal period.

Next, as indicated in the comment, square brackets surrounding text are
changed to (?:text)?. This is a regular expression group that is
optional (from the trailing ?) and non-matching (from the ?:
present
after the opening parenthesis). Using a non-matching group allows us to
operate on the group as a whole (e.g. making it optional) and avoids
remembering the content of that group.

Benjamin’s next substitution changes <text> to (.+?), a
non-optional,
matching group. These are the field values we want to remember later.
The
.+? mechanism matches a string of any characters with length of at
least
one. The more familiar .+ does the same thing, but greedily, and would
fail for any rule with more than one field (since the first field’s <
would match the last field’s >). The question-mark of .+? turns this
pattern into a non-greedy match.

Finally, the new string, after these substitutions, is prefixed with ^
and
suffixed with $ to indicate the beginning and end of the line, and a
new
Regexp object is created.

Now, Benjamin’s solution for creating regular expressions is a start,
but
incomplete. It will work with the sample rules and data provided, but in
order to make this a more general solution, we need to consider other
special characters. What if a rule contained a question-mark, asterisk,
or
any other character special to regular expressions? We’d have to escape
each
of those. Likewise, we don’t want to escape square brackets (i.e. a
special character both for our rules and for Ruby regular expressions),
but
instead transform them into a completely different pattern.

The answer to this problem can be found in a couple of submissions,
particularly those of Jesus Gabriel and Sandro P.. Both made
use
of the method Regexp.escape which does exactly the sort of escaping
that
Benjamin accomplishes with his first gsub! above. The difference
between
Jesus’s and Sandro’s solutions, then, is in the handling of square
brackets.

Jesus deals with square brackets by replacing them with repeated
underscore
characters (i.e. some uncommon text), calling Regexp.escape, then
changing
the underscores back to brackets. While easy, I’m not fond of this
particular approach in general situations, since as uncommon as the
temporary text might be, it could show up in some user’s data set.

Sandro takes a different approach, allowing the square brackets to be
escaped and dealing with the results.

Regexp.escape(r.chomp).gsub("\\[", "[").gsub("\\]", "]")

(Sandro’s solution actually had an extra \, which does work but is
unnecessary, since strings are being passed to gsub here, not regular
expressions). The benefit of this technique is that it cannot be fooled
by
uncommon or rare input from the user’s rules or data set.

Sandro’s solution also contains other calls to substitute appropriate
regular expression groups for the square and angle brackets. This, I
believe, is the best solution for the transformation of our statistician
rules into Ruby regular expressions.

Tomorrow, we will continue developing our Statistician, delving into
modules
and meta.

mansfiem · July 3, 2008, 9:40pm

I don’t like being nit-picking but (there is always a “but” :-):

The benefit of this technique is that it cannot be fooled by
uncommon or rare input from the user’s rules or data set.

With the exception of something like:

Foo [bar] foo.

The gsubs will turn such a rule into: Foo [bar] foo.

mansfiem · July 3, 2008, 9:55pm

Foo [bar] foo.

The gsubs will turn such a rule into: Foo [bar] foo.

Okay … well, uhm … rewind … undo e-mail … I assumed escape
rules that are specified in the quiz. So it’s okay if you don’t want
to match text containing brackets.

Good night.

mansfiem · July 3, 2008, 10:31pm

On Jul 3, 2:51 pm, ThoML [email protected] wrote:

Foo [bar] foo.

The gsubs will turn such a rule into: Foo [bar] foo.

Okay … well, uhm … rewind … undo e-mail … I assumed escape
rules that are specified in the quiz. So it’s okay if you don’t want
to match text containing brackets.

Basically, the quiz specification isn’t too specific about these sorts
of things. The basic assumption I made (which I should have stated
out) was that the data input would not contain square/angle brackets.
If the specification was complete (i.e. what to do about matching
literal brackets), the code might have to be changed accordingly.

In either case, it would still be the situation where the escaped
string cannot be confused, whereas the modify-escape-modify string
could still suffer from confusion, if the temporary placeholder is
possible as input data.