Way to divide long article and store in database

winterheat · May 31, 2009, 10:39pm

I wonder if a Ruby on Rails developer has encounter this before: suppose
it is a long article (say 100,000 words), and I need to write a Ruby
file to display page 1, 2, or page 38 of the article, by

display.html.erb?page=38

but the number of words for each page can change over time (for example,
right now if it is 500 words per page, but next month, we can change it
to 300 words per page easily). What is a good way to divide the long
article and store into the database?

P.S. The design may be complicated if we want to display 500 words but
include whole paragraphs. That is, if we are showing word 480 already
but the paragraph has 100 more words remaining, show those 100 words
anyway even though it exceeds the 500 words limit.

winterheat · June 1, 2009, 12:28am

Make each page a text file, put them all in a directory (document/1.txt,
document/2.txt, etc), and then you won’t even have to use the database.

winterheat · June 1, 2009, 5:09am

Jian L. wrote:

I wonder if a Ruby on Rails developer has encounter this before: suppose
it is a long article (say 100,000 words), and I need to write a Ruby
file to display page 1, 2, or page 38 of the article, by

display.html.erb?page=38

but the number of words for each page can change over time (for example,
right now if it is 500 words per page, but next month, we can change it
to 300 words per page easily

Why divide it in the database? Store it one field in the database, and
when you fetch it from the database just perform the logic to find
page=38 and then display that.

If actual testing indicates that’s too slow with the actual quantity of
data you expect, then you’d have to perform a word-boundary calculation
on inserting the value in the db, and store the results as an ‘index’ to
the text somehow.

Either way, I don’t see any reason to actually split up the text file in
the db. Unless you want to let the user search for, say, word X on
page N of the text. But then you’re getting into complicated enough text
searching land that I’d investigate using something like lucene/solr to
index your text, instead of an rdbms, and seeing what support for
page-boundary-based-searching eg lucene/solr have.

winterheat · June 1, 2009, 5:14am

Jonathan R. wrote:

Jian L. wrote:

I wonder if a Ruby on Rails developer has encounter this before: suppose
it is a long article (say 100,000 words), and I need to write a Ruby
file to display page 1, 2, or page 38 of the article, by

display.html.erb?page=38

but the number of words for each page can change over time (for example,
right now if it is 500 words per page, but next month, we can change it
to 300 words per page easily

Why divide it in the database? Store it one field in the database, and
when you fetch it from the database just perform the logic to find
page=38 and then display that.

is it true that it all the 100,000 words are in one record (one row),
then every time, the whole field needs to be retrieved. If we assume
one work is about 6 characters long (with the space), then it is
600kbyte per read. I hope to make it “read as needed”… 500 words and
about 3kbyte read per page each time.

winterheat · June 1, 2009, 9:23am

If you must split it up in the database, your changing your mind
from 500 to 300 is going to suck, otherwise you might use a “pages”
assocation or something of the like which would be very simple…

for instance:

class Article < ActiveRecord::Base
has_many :pages

validates_presence_of :text

after_create
i = 0
b = text.scan(/\b\S+\b/)
b.each_slice(500) do |x|
self.pages.create(:page => i+=1, :text => x.join(" "))
end
end

end

class Page < ActiveRecord::Base
belongs_to :article
end

Someone probably has a MUCH prettier method of doing this, was just
kind of on-the-fly…

Cheers!