Opening a large file many times / optimisation

hello,

I have a method that basically searches through a largish (5mb) text
file for a word. I need to call this method about 1400 times, and i
care about speed.

If i open the file at the start of the script and then pass the file
object as a parameter to my method each time its called, the code runs
quite a bit faster than if i open the file inside the method each
time; but this seems ugly to me.

Is there a standard way to do this in ruby? How much overhead is
involved in opening a large text file?

thanks.

On Mar 11, 2007, at 11:30 AM, Paul Nulty wrote:

involved in opening a large text file?
Well, if you have enough RAM to support pulling it into memory,
that’s certainly going to be faster. However, there are some
techniques you could use to speed up and index and query operation.
See this old Ruby Q. for some ideas:

http://www.rubyquiz.com/quiz54.html

James Edward G. II

  1. You need to define the problem better. Are you searching for a
    different word each time, does the file change each time, etc. Why do
    you have to call it 1400 times?

ok here’s a few lines from the file i’m searching (its a wordnet file
that holds different senses of words)

concavity%1:07:00:: 05070032 2 0
concavity%1:25:00:: 13864965 1 0
concavo-concave%5:00:00:concave:00 00536008 1 0
concavo-convex%5:00:00:concave:00 00536416 1 0
conceal%2:39:00:: 02146790 2 1
conceal%2:39:01:: 02144835 1 8
concealed%3:00:00:: 02088404 2 1
concealed%5:00:00:invisible:00 02517817 1 2
concealing%1:04:00:: 01048912 1 0
concealing%3:00:00:: 02091020 1 0

i need to search for the first part (e.g. conceal%2:39:00::slight_smile: and
return the second last number (eg. 2). (getting the sense from the
sense key, if you know wordnet)

i have 1400 words, the wordnet file will never change. i’m unlikely to
need to scale up much past 1400.

here’s my code: (senseKey is eg “conceal%2:39:00::”)

lines=File.readlines("/usr/local/WordNet-3.0/dict/index.sense")

#gets a sysnet number from a sense key
def getSense(senseKey,lines)
for line in lines
if line.index(senseKey)==0
words=line.split(" ")
return words[-2]
end
end
end

thanks again!

Paul Nulty wrote:

Is there a standard way to do this in ruby? How much overhead is
involved in opening a large text file?

thanks.

  1. You need to define the problem better. Are you searching for a
    different word each time, does the file change each time, etc. Why do
    you have to call it 1400 times?

  2. Searching and indexing are extremely well documented areas of
    computer science. Once you’ve correctly defined your problem, I’m sure
    you’ll come up with something far more efficient than a brute force
    “open a five megabyte file, read the whole enchilada into RAM, and do a
    text search for the word, then close the file and wait for the next
    request”.

  3. Do you care about scalability, or is the file never going to get
    bigger than 5 MBytes? Is the method always going to be called “only”
    1400 times, or will someone see your success and say, “Great – here’s
    20 million words!”?


M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.

Paul Nulty wrote:

concavo-concave%5:00:00:concave:00 00536008 1 0
return the second last number (eg. 2). (getting the sense from the
def getSense(senseKey,lines)

Isn’t there a Ruby/Wordnet interface? Doctor Google recommended


M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.

On Mon, Mar 12, 2007 at 02:25:08AM +0900, Paul Nulty wrote:

concavo-concave%5:00:00:concave:00 00536008 1 0
return the second last number (eg. 2). (getting the sense from the
sense key, if you know wordnet)

i have 1400 words, the wordnet file will never change. i’m unlikely to
need to scale up much past 1400.

If you’re searching a 5MB file 1400 times, it’s almost certainly worth
reading it in once and building a hash as you go. Remember that on
average,
you are reading half the lines in the file on every search. So you
should
speed up by a factor of nearly 700 just by doing this.

If the wordnet file is too big to fit into RAM, then there are ways of
indexing the file on disk to make it quicker to search (external
searching)

end

end
end

Try something like:

class Wordnet
def initialize(filename)
@words = {}
File.open(filename) do |f|
f.each_line do |line|
fields = line.chomp.split(/ /)
key = fields.shift
@words[key] = fields
end
end
end
def sysnet(senseKey)
@words[senseKey][1]
end
end

wn = Wordnet.new("/usr/local/WordNet-3.0/dict/index.sense")

Now do this 1400 times for different keys

puts wn.sysnet(“conceal%2:39:00::”)

Isn’t there a Ruby/Wordnet interface? Doctor Google recommended
dev(E)iate

yep i’m using that; it’s great but as far as i can tell it doesn’t use
sense keys, it uses sense numbers. I only have the sense keys, so i
need to get the sense number from the sense key manually.

On 11.03.2007 18:55, Paul Nulty wrote:

Isn’t there a Ruby/Wordnet interface? Doctor Google recommended
dev(E)iate

yep i’m using that; it’s great but as far as i can tell it doesn’t use
sense keys, it uses sense numbers. I only have the sense keys, so i
need to get the sense number from the sense key manually.

Try reading the file and storing all combinations in a Hash with sense
key as key and number as value.

robert

Thanks!

before:

142.800000 0.100000 142.900000 (156.818797)

after (with hash)

9.900000 0.100000 10.000000 ( 11.259273)

thanks again.