Count words occurrences in a very large text file

Hi,
I need to count the number of occurrences of each word in a large text file (>5GB).
I was thinking about creating a HashMap and while traversing the file I will update the count for each word as I go along.
I can’t read the whole file at once because it will bog down my memory, so I thought of reading it in chunks. I could only find ways of reading a text file line-by-line in this link:

However, I couldn’t find a way to read chunks of the file in case it doesn’t have line breaks (one large row). How can I, for example, read chunks of 1000 words in each read?

Thanks,

I don’t have access to that big file to test. But I would like to say don’t use the IO#readlines or IO#read methods because they just reads the whole file and keep the content on the RAM. Use IO.foreach for maximum efficiency…

Here’s a problem I faced:

So basically you just use the evaluator and add items to an array based on your need.

The hard part will be determining what constitutes a word and what constitutes a word delimiter. Regular expressions should help here if you can determine what kind of ‘text’ file you are working with.

First, thanks for the response!
Let’s say that a word is anything between two spaces, disregarding punctuations (!()?, and so forth).
I don’t think that I will consider the case if we have apostrophes (for example “it’s”) in the word because it complicates the parsing too much.

The big question is -> What do you mean by a text file… Do you mean ASCII text? Do you mean UTF-8?

It should support parsing English only, let’s say articles from the NYT concatenated together and put in a one text file (but with no line breaks).

Do you have a link to some data that could represent the larger set or do you have some mechanism to generate that text file?

I generated a 1.2 GiB text file by reading words from /usr/share/dict/words, which repeats the words file 1200 times over and over. But I think I created a wrong file.

If you can give us a small sample of your file, say the first 20 lines, it would be very helpful… Otherwise you could just go with IO#foreach which will not use a huge amount of memory…

I don’t have a sample just yet but to generate a sample that will do, any article in English would be ok, removing the line breaks using this website:

and duplicate the text, for that matter, enough times until a large file size is achieved.
Something like this (just way larger):

I see what you are getting at now. You are trying to account for a huge text file that may be one large line(no newlines).

Well Ruby allows the programmer to determine what the input record separator character might be… I’d choose a space and read word after word. This sounds inefficient but it really isn’t as inefficient as it sounds because most operating systems read in a page(4k-8k of data) of data whether you read a single character or word or line.

$/
The input record separator (newline by default). gets, readline, etc., take their
input record separator as optional argument.

Here’s an example using the input record separator($/).

#! /usr/bin/env ruby

if __FILE__ == $0

  str = "This is the first"

  p str.lines.map{|l| l.chomp}

  orig = $/

  $/ = ' '

  p str.lines.map{|l| l.chomp}

  $/ = orig

  p str.lines.map{|l| l.chomp}

end