The fastest way to read files

dubstep · November 9, 2011, 5:56pm

Does anybody know which is the fastest way to read a file? Lets say
there are 1, 000, 000 files with sizes not exceeding 10 kb.

Thanks in advance.

nax · November 9, 2011, 6:25pm

What are you going to do with the files? Do you need to read all the
data before start processing them? Or you can do it sequentially?
Can you do it in a distributed way? It really depends.

-Jingjing

nax · November 9, 2011, 6:38pm

Sequentially… The point is that I need to process each pair of files,
so I don’t know which of the different ways that Ruby has to read files
is the fastest.

Thank you.

nax · November 9, 2011, 6:57pm

For small files, there shouldn’t be much of a difference reading line by
line
or readling all the lines in a single sweep. Ruby IO handles the
buffering
for you.

If you are really concerned about performance, why not do some
benchmarking?

-Jingjing

nax · November 9, 2011, 7:23pm

Perfect! Thank you Jing.

nax · November 10, 2011, 2:17pm

2011/11/9 No Alejandro [email protected]:

Sequentially… The point is that I need to process each pair of files,
so I don’t know which of the different ways that Ruby has to read files
is the fastest.

What kind of processing do you need to do on those files?

Kind regards

robert

nax · November 11, 2011, 1:03pm

Hi Robert.

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don’t need to rewrite the information,
only read them and then close them.

Regards.

nax · November 11, 2011, 1:16pm

2011/11/11 No Alejandro [email protected]:

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don’t need to rewrite the information,
only read them and then close them.

I don’t understand: you wrote earlier you need to process pairs of
files but all these operations mentioned above can be done on a single
file. Plus, if you do not write the modified content anywhere what’s
the point of the exercise? That would only burn CPU and disk IO for
nothing.

Kind regards

robert

nax · November 11, 2011, 1:33pm

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

nax · November 11, 2011, 1:51pm

2011/11/11 No Alejandro [email protected]:

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

Aha. I assume you do your analysis based on words. In that case
something like this might be efficient:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}
…

words_in_file = []

File.foreach a_file_name do |line|
line.scan(/\w+/) do |word|
word.downcase!
words_in_file << words[word]
end
end

Kind regards

robert

nax · November 11, 2011, 10:08pm

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

irb(main):001:0> h = {}
=> {}
irb(main):002:0> h[“blah”] = 42
=> 42
irb(main):003:0> h.keys.map(&:frozen?)
=> [true]

nax · November 12, 2011, 12:21am

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected]
wrote:

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That’s the reason why I do the freeze in the block.

Cheers

robert

nax · November 12, 2011, 12:22am

On Sat, Nov 12, 2011 at 12:20 AM, Robert K.
[email protected] wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected] wrote:

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That’s the reason why I do the freeze in the block.

PS: It’s true for String keys only.

nax · November 11, 2011, 4:01pm

Great! Thanks for the advice.

Greetings.

nax · November 12, 2011, 1:03am

On Nov 11, 2011, at 15:20 , Robert K. wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected] wrote:

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That’s the reason why I do the freeze in the block.

I’m confused. If you know that the key is going to be frozen anyways,
why freeze it?

nax · November 12, 2011, 12:35pm

On Sat, Nov 12, 2011 at 1:26 AM, Eric W. [email protected]
wrote:

I know. That’s the reason why I do the freeze in the block.

I’m confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#[]= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#[]= prevents MRI[1] from
duplicating the string.

Exactly. And in that case we would end up with two objects in memory
where one is sufficient:

irb(main):008:0> s = “foo”
=> “foo”
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k}
=> {}
irb(main):010:0> h[s]
=> “foo”
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {“foo”=>“foo”}

Kind regards

robert

nax · November 12, 2011, 1:27am

Ryan D. [email protected] wrote:

I’m confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#[]= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#[]= prevents MRI[1] from
duplicating the string.

------------------------ freeze_example.rb --------------------
h = {}
frozen = “foo”.freeze
h[frozen] = true

explicitly freezing the key means the String object is stored as-is

p [ :frozen_original_key, frozen.object_id ]
p [ :frozen_key_after_aset, h.keys[0].object_id ]

h = {}
not_frozen = “foo”
h[not_frozen] = true

Not freezing means the key stored in the hash key is a different

object than the one provided by the user.

p [ :not_frozen_original_key, not_frozen.object_id ]
p [ :not_frozen_key_after_aset, h.keys[0].object_id ]

------------------------ Output --------------------------------
[:frozen_original_key, 70096844693360]
[:frozen_key_after_aset, 70096844693360]
[:not_frozen_original_key, 70096844693120]
[:not_frozen_key_after_aset, 70096844694120]

[1] - Verified by reading rb_hash_aset() in hash.c which eventually
calls rb_str_new_frozen() in string.c (ruby/trunk):

rb_str_new_frozen(VALUE orig)
{
VALUE klass, str;

if (OBJ_FROZEN(orig)) return orig;

...

nax · December 29, 2011, 10:47am

-----Messaggio originale-----
Da: Robert K. [mailto:[email protected]]
Inviato: sabato 12 novembre 2011 12:35
A: ruby-talk ML
Oggetto: Re: The fastest way to read files

On Sat, Nov 12, 2011 at 1:26 AM, Eric W. [email protected]
wrote:

Ryan D. [email protected] wrote:

On Nov 11, 2011, at 15:20 , Robert K. wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected]
wrote:
why freeze it?

The implicit freeze from Hash#[]= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#[]= prevents MRI[1] from
duplicating the string.

Exactly. And in that case we would end up with two objects in memory
where
one is sufficient:

irb(main):008:0> s = “foo”
=> “foo”
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k} => {} irb(main):010:0>
h[s]
=> “foo”
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {“foo”=>“foo”}

Kind regards

robert

–
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

–
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu’ IMAP, POP3 e
SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Capodanno a Riccione, Pacchetto Relax: Mezza Pensione + bagno turco +
solarium + massaggio. Wifi e parcheggio gratis. 2 giorni euro 199 a
persona
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid978&d)-12