Code Review: parseroptfinal

tfpt review “/shelveset:parseroptfinal;REDMOND\tomat”

Improves Ruby tokenizer and parser performance by 30% total (previously
3 MBps, now 4.2 MBps - measured on real code of libraries we have
checked in and including Ruby AST building, but not transformation to
DLR ASTs).

Adds /bm option to tokenizer test driver that measures parsing speed
(IronRuby.Tests.exe /tokenizer /bm).
Adds more tokenization unit tests and fixes several minor bugs in
tokenizer.

AST, Parser.cs, parser.y:

  •   Replaces List<Expression> by a new class Statements where the 
    

list represents statements. This encapsulation us to change underlying
representation w/o affecting code using the statements. Also refactors
the rules for statement list so that we can use an empty statements
singleton.

  •   Replaces List<Expression> by Expression[] where it represents 
    

arguments. Arguments are the most frequent list of expression so it make
sense to optimize it. Previously any list of arguments was constructed
starting from an empty List and adding arguments as parser
reduced them. This is quite inefficient since most method calls have
very few arguments (0 to 3). The parser now has a stack of Expression to
which it pushes arguments as they are reduced. Each rule that defines
argument list counts the number of arguments in the list in token value
that is propagated throughout the list reduction. The rule that consumes
the argument list then simply pops the right amount of arguments out of
it and copies it to an exactly sized array. This might be further
optimized if necessary - we can specialize Argument node for 0 to 3
arguments, or just remember a reference to the arguments stack and index
to the array.
We can use this approach for other lists as well (maybe for statemens,
string concatenations, etc.) in future.

  •   Provides dummy implementations for features that are not 
    

implemented yet: defined?(super), BEGIN, END. This avoids throwing not
impl. exceptions so that we can measure performance of AST
transformations over entire library code base.

GPPG:

  •   Removes Parser <: ShiftReduceParser inheritance. It was meant to 
    

enable sharing of generic GPPG parser driver and derive specific parsers
but since we don’t share the class it makes no sense. Still GPPG.cs is
mostly Ruby independent piece of code that could be easily copied by
other languages.

  •   Merges 3 stacks (state, value and location) into a single data 
    

structure reducing cost of push and pop.

  •   Improves reduction of empty rules, removes unnecessary array 
    

lookups on hot path (these changes require some adjustments in GPPG
generator).

TokenValue:

  •   Reduces size of the structure by using an explicit layout 
    

definition. This reduces amount of data moved around during reductions.

  •   It might be possible in future to get rid of the obj2 field 
    

(it’s used only by argument rules).

Tokenizer:

  •   Removes a direct dependency on the parser - we only need a 
    

resolver for local variables. Introduces ILexicalVariableResolver
(implemented by Parser) that provides such information. This allows
tests to run tokenizer with mocked parser.

  •   Replaces nextc(), peekc() and pushback() methods by Read, Peek, 
    

Back, SeekRelative, which don’t implicitly normalize end of lines. This
allows us to drop yytext StringBuilder used to track normalized eolns.
It was the culprit of very bad performance. Basically entire tokenized
source code got eventually loaded into this StringBuilder one character
in time :-/. The tokenizer now needs to deal with eoln normalization
explicitly, which required a lot of small tweaks.

  •   Reimplements line buffer loading - previous implementation was 
    

allocating 2 char[] per line, which is of course terrible waste. The
buffer is now shared for all lines of code except for heredoc that needs
to be handled in a special way.

  •   The tokenizer is now finally in a good shape, although there are 
    

more small improvements that could be done.

Tomas

Looks good!