I need to count the number of occurrences of each word in a large text file (>5GB).
I was thinking about creating a HashMap and while traversing the file I will update the count for each word as I go along.
I can’t read the whole file at once because it will bog down my memory, so I thought of reading it in chunks. I could only find ways of reading a text file line-by-line in this link:
However, I couldn’t find a way to read chunks of the file in case it doesn’t have line breaks (one large row). How can I, for example, read chunks of 1000 words in each read?
I don’t have access to that big file to test. But I would like to say don’t use the IO#readlines or IO#read methods because they just reads the whole file and keep the content on the RAM. Use IO.foreach for maximum efficiency…
Here’s a problem I faced:
So basically you just use the evaluator and add items to an array based on your need.
The hard part will be determining what constitutes a word and what constitutes a word delimiter. Regular expressions should help here if you can determine what kind of ‘text’ file you are working with.
First, thanks for the response!
Let’s say that a word is anything between two spaces, disregarding punctuations (!()?, and so forth).
I don’t think that I will consider the case if we have apostrophes (for example “it’s”) in the word because it complicates the parsing too much.
I see what you are getting at now. You are trying to account for a huge text file that may be one large line(no newlines).
Well Ruby allows the programmer to determine what the input record separator character might be… I’d choose a space and read word after word. This sounds inefficient but it really isn’t as inefficient as it sounds because most operating systems read in a page(4k-8k of data) of data whether you read a single character or word or line.
The input record separator (newline by default). gets, readline, etc., take their
input record separator as optional argument.