Count words occurrences in a very large text file

John3415 · September 19, 2019, 3:51pm

Hi,
I need to count the number of occurrences of each word in a large text file (>5GB).
I was thinking about creating a HashMap and while traversing the file I will update the count for each word as I go along.
I can’t read the whole file at once because it will bog down my memory, so I thought of reading it in chunks. I could only find ways of reading a text file line-by-line in this link:

However, I couldn’t find a way to read chunks of the file in case it doesn’t have line breaks (one large row). How can I, for example, read chunks of 1000 words in each read?

Thanks,

SouravGoswami · September 19, 2019, 4:07pm

I don’t have access to that big file to test. But I would like to say don’t use the IO#readlines or IO#read methods because they just reads the whole file and keep the content on the RAM. Use IO.foreach for maximum efficiency…

Here’s a problem I faced:

So basically you just use the evaluator and add items to an array based on your need.

G4143 · September 19, 2019, 7:03pm

The hard part will be determining what constitutes a word and what constitutes a word delimiter. Regular expressions should help here if you can determine what kind of ‘text’ file you are working with.

John3415 · September 20, 2019, 8:16am

First, thanks for the response!
Let’s say that a word is anything between two spaces, disregarding punctuations (!()?, and so forth).
I don’t think that I will consider the case if we have apostrophes (for example “it’s”) in the word because it complicates the parsing too much.

G4143 · September 20, 2019, 8:31am

The big question is -> What do you mean by a text file… Do you mean ASCII text? Do you mean UTF-8?

John3415 · September 20, 2019, 9:02am

It should support parsing English only, let’s say articles from the NYT concatenated together and put in a one text file (but with no line breaks).

G4143 · September 20, 2019, 9:46am

Do you have a link to some data that could represent the larger set or do you have some mechanism to generate that text file?

SouravGoswami · September 20, 2019, 2:05pm

I generated a 1.2 GiB text file by reading words from /usr/share/dict/words, which repeats the words file 1200 times over and over. But I think I created a wrong file.

If you can give us a small sample of your file, say the first 20 lines, it would be very helpful… Otherwise you could just go with IO#foreach which will not use a huge amount of memory…

John3415 · September 20, 2019, 3:57pm

I don’t have a sample just yet but to generate a sample that will do, any article in English would be ok, removing the line breaks using this website:

and duplicate the text, for that matter, enough times until a large file size is achieved.
Something like this (just way larger):
https://ufile.io/sn9zrlu6

G4143 · September 20, 2019, 9:37pm

I see what you are getting at now. You are trying to account for a huge text file that may be one large line(no newlines).

Well Ruby allows the programmer to determine what the input record separator character might be… I’d choose a space and read word after word. This sounds inefficient but it really isn’t as inefficient as it sounds because most operating systems read in a page(4k-8k of data) of data whether you read a single character or word or line.

$/
The input record separator (newline by default). gets, readline, etc., take their
input record separator as optional argument.

G4143 · September 20, 2019, 9:46pm

Here’s an example using the input record separator($/).

#! /usr/bin/env ruby

if __FILE__ == $0

  str = "This is the first"

  p str.lines.map{|l| l.chomp}

  orig = $/

  $/ = ' '

  p str.lines.map{|l| l.chomp}

  $/ = orig

  p str.lines.map{|l| l.chomp}

end