I have a large file that I need to tokenize. The method I am using now
is fast, but eats up a ton of memory by reading in the entire file first
as a String. I would also like to reuse existing tokens for duplicates.
(I have no control over the file format, but this Regex works well for
what I need.)
Here is what I am doing today.
And here is what I would like to do.
File.open(filename) do |fh|
fh.scan(/’[^’]’|"[^"]"|[(:)]|[^(:)\s]+/) do |token|
tokens << i=tokens.index(token) ? tokens[i] : token
So what I would like to have is a scan method for File objects that
yields the tokens when called with a block, instead of returning an
array. (It would be nice if String#scan could do this as well.) This
isnâ€™t a big issue, it just causes my machine to overflow to the swap
file periodically. I could easily fix that with a couple DIMMs, but I
canâ€™t help thinking that there should be a better way.