Hmm. Simplifying my test script further, I am not sure that Regexp is
the problem at all!
With the each_line block, my script take more than TWICE as long in
1.9 vs. 1.8.
But without the each_line block, but keeping the Regexp, it is 10%
FASTER.
So unless there is some internal optimisation that occurs when the
block is removed, it looks like each_line is the problem, not Regexp???
500000.times do
end
$ ruby logreport3.rb
user system total real
WITH each_line: 1.710000 0.000000 1.710000 ( 1.717034)
WITHOUT each_line: 1.080000 0.000000 1.080000 ( 1.077098)
D> So unless there is some internal optimisation that occurs when the
D> block is removed, it looks like each_line is the problem, not
Regexp???
Well, some part of #each_line for 1.8.6
for (s = p, p += rslen; p < pend; p++) {
if (rslen == 0 && *p == '\n') {
if (*++p != '\n') continue;
while (*p == '\n') p++;
}
easy : increment p and test
the same for 1.9
while (p < pend) {
int c = rb_enc_codepoint(p, pend, enc);
int n = rb_enc_codelen(c, enc);
if (rslen == 0 && c == newline) {
while (p < pend && rb_enc_codepoint(p, pend, enc) ==
Hmm. Simplifying my test script further, I am not sure that Regexp is
the problem at all!
With the each_line block, my script take more than TWICE as long in
1.9 vs. 1.8.
But without the each_line block, but keeping the Regexp, it is 10%
FASTER.
Oops, It seems you’re right, just split the original logfile and use
each instead of each_line and it gets a whole lot faster (the
rb_str_each_line is encoding aware). Anyways, it doesn’t change the fact
that Oniguruma might be opted here as well.
lopex
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.