Text/Tag Cloud generation

joviyach · October 11, 2007, 8:15pm

I want to generate a â€œtext cloudâ€, except there are a couple things out
of the ordinary I am doing with this.

For one, it is not for view on the web, but rather offline use. Output
to an image would be great, but HTML is still just fine (especially as I
imagine that would be much easier to accomplish).

Second item is, I am looking to do this with extremely large (in
comparison of most text cloud generators I have seen) data sets. I am
looking on the order of say min of 2 Gigabytes of text, as high as 10
Gig or more. So obviously I will need some sort of sliding scale of a
threshold that will need to be crossed before a â€œwordâ€ shows up, as I
wonâ€™t want every string to appear in the cloud.

As most of the cloud generators I have seen seem directed for use with a
much smaller dataset, I am worried about how they would scale and memory
consumption. I have seen a number of Ruby based cloud generators, how
ever most of which are integrated into rails applications. I am hoping
to do this stand alone, command line ideally.

I donâ€™t have to do this in Ruby as far as that goes, I am just working
off the thought I canâ€™t find a tool to do what I need, so I am going to
have to write one. So I was hoping at the very least to find some code I
can alter to do what I want. If anyone knows of a tool in any language
that does what I am looking for, that would be even better!

Thanks for any help.
Jim

joviyach · October 12, 2007, 7:20am

Jim Og wrote:

I want to generate a â€œtext cloudâ€, except there are a couple things out

If anyone knows of a tool in any language
that does what I am looking for, that would be even better!

Thanks for any help.
Jim

Tools are listed just a minor scroll down the page

Enjoy!

joviyach · October 12, 2007, 4:26pm

Michael L. wrote:

BONANZA88 : Game Judi Slot Online Gacor Populer Paling Murah Indonesia

Tools are listed just a minor scroll down the page

Enjoy!

Thanks for the reply. I had run across that link in the past, and the
problem I am having is all the tools on there are designed for web use
and smaller datasets then what I am working with. The designed for web
use is no big deal, as I can take the portion of code I am interested,
or just redirect input and output. But the designed for smaller datasets
is a bigger issue (no pun intended).

I need a tool which can process multiple Gig of data, without crashing,
and output a text cloud.

Thanks
Jim

joviyach · October 12, 2007, 6:26pm

Jano S. wrote:

There can be
only so many words there.
Actually, no. The number of different words in a corpus goes up roughly
as O(sqrt(n)). That’s Heap’s law.

joviyach · October 12, 2007, 6:11pm

On 10/12/07, Jim Og [email protected] wrote:

Thanks for the reply. I had run across that link in the past, and the
problem I am having is all the tools on there are designed for web use
and smaller datasets then what I am working with. The designed for web
use is no big deal, as I can take the portion of code I am interested,
or just redirect input and output. But the designed for smaller datasets
is a bigger issue (no pun intended).

I need a tool which can process multiple Gig of data, without crashing,
and output a text cloud.

Isn’t a text cloud just a fancy word/phrase frequency table, or am I
missing something here? If it indeed is, how fast do you need it to
be?

I wouldn’t worry about stability, but speed instead… There can be
only so many words there.

J.

joviyach · October 12, 2007, 6:34pm

On 10/12/07, Alex Y. [email protected] wrote:

Jano S. wrote:

There can be
only so many words there.
Actually, no. The number of different words in a corpus goes up roughly
as O(sqrt(n)). That’s Heap’s law.

Ok What I meant was that for given size of corpus (~10GB) the
hash (word->frequency) should be reasonable large to fit in memory. I
haven’t known the Heap’s law, thanks for the info. In some reference
on the web, they say for English Vr(n) ~ K * sqrt(n) where 10 < K <
100. That means, for n = 310^9 Vr(n) < 510^6, and that’s not large
(well…

Anyway, thanks for the pointer.

Jano