@Mike
Thank you for providing the Gist link to a file.
(file_read_parse_2.rb · GitHub)
However, the changes don’t improve the performance when I take into
account what was removed and I had in there on purpose. Take note of
item #2 below.
- Object structure
The modified code removed all of the class/object structure, which I
purposefully had in there to simulate this being an object within a
larger project.
That being said, converting the lines of code we’re discussing for
performance into a script means nothing to this discussion - but I
purposefully am writing the code in an OO style with classes as opposed
to scripts.
I was also purposefully making the Java and Ruby versions as similar to
each other so as to allow a performance comparison to be done with as
little difference as possible in approaching the code.
- Counting versus Using the Tokens
In the modified code, it is now just counting the tokens:
num += l.split.length
Obviously that is faster than what I had in the original code. Again
however, I’m doing this on purpose.
Counting the number of tokens in an of itself is not all that I was
doing in the original code or in the Java version. To simulate more
closely what actually occurs in a functional system I am:
- assigning the array of tokens to a variable
- iterating the tokens to do something with each of them
In this case I’m just assigning each token to another variable and then
performing the count.
In a real world use I’d perform some function on the text, put it
somewhere, whatever.
This change accounts for the difference in time from “7965.289 ms” to
“4821.399 ms” when I run the original code and the modified code.
So yes, the modified code is “faster”, but it’s not doing the same thing
as the original and therefore not a valid comparison.
What I gather therefore from looking at your changes, is that there
really isn’t anything different for me to do in the code - that I am in
fact using the proper API calls and techniques and there is nothing
special.
For example, in Java there are 2 ways of doing this:
a) String.split - which uses REGEX and is much slower as it’s intended
for pattern matching, not simple tokenization
b) StringTokenizer - intended for tokenization on a delimiter instead of
REGEX and much faster
Therefore, I’m using option (b) in Java. I was curious if I was
mistakenly using a slower technique of Ruby when in fact there was a
faster alternative.