The fastest way to read files

Does anybody know which is the fastest way to read a file? Lets say
there are 1, 000, 000 files with sizes not exceeding 10 kb.

Thanks in advance.

What are you going to do with the files? Do you need to read all the
data before start processing them? Or you can do it sequentially?
Can you do it in a distributed way? It really depends.

-Jingjing

Sequentially… The point is that I need to process each pair of files,
so I don’t know which of the different ways that Ruby has to read files
is the fastest.

Thank you.

For small files, there shouldn’t be much of a difference reading line by
line
or readling all the lines in a single sweep. Ruby IO handles the
buffering
for you.

If you are really concerned about performance, why not do some
benchmarking?
http://www.ruby-doc.org/stdlib-1.9.2/libdoc/benchmark/rdoc/Benchmark.html

-Jingjing

Perfect! Thank you Jing.

2011/11/9 No Alejandro [email protected]:

Sequentially… The point is that I need to process each pair of files,
so I don’t know which of the different ways that Ruby has to read files
is the fastest.

What kind of processing do you need to do on those files?

Kind regards

robert

Hi Robert.

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don’t need to rewrite the information,
only read them and then close them.

Regards.

2011/11/11 No Alejandro [email protected]:

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don’t need to rewrite the information,
only read them and then close them.

I don’t understand: you wrote earlier you need to process pairs of
files but all these operations mentioned above can be done on a single
file. Plus, if you do not write the modified content anywhere what’s
the point of the exercise? That would only burn CPU and disk IO for
nothing.

Kind regards

robert

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

2011/11/11 No Alejandro [email protected]:

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

Aha. I assume you do your analysis based on words. In that case
something like this might be efficient:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

words_in_file = []

File.foreach a_file_name do |line|
line.scan(/\w+/) do |word|
word.downcase!
words_in_file << words[word]
end
end

Kind regards

robert

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

irb(main):001:0> h = {}
=> {}
irb(main):002:0> h[“blah”] = 42
=> 42
irb(main):003:0> h.keys.map(&:frozen?)
=> [true]

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected]
wrote:

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That’s the reason why I do the freeze in the block.

Cheers

robert

On Sat, Nov 12, 2011 at 12:20 AM, Robert K.
[email protected] wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected] wrote:

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That’s the reason why I do the freeze in the block.

PS: It’s true for String keys only.

Great! Thanks for the advice.

Greetings.

On Nov 11, 2011, at 15:20 , Robert K. wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected] wrote:

On Nov 11, 2011, at 04:50 , Robert K. wrote:

ensure every word is only once in memory

words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That’s the reason why I do the freeze in the block.

I’m confused. If you know that the key is going to be frozen anyways,
why freeze it?

On Sat, Nov 12, 2011 at 1:26 AM, Eric W. [email protected]
wrote:

I know. That’s the reason why I do the freeze in the block.

I’m confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#[]= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#[]= prevents MRI[1] from
duplicating the string.

Exactly. And in that case we would end up with two objects in memory
where one is sufficient:

irb(main):008:0> s = “foo”
=> “foo”
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k}
=> {}
irb(main):010:0> h[s]
=> “foo”
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {“foo”=>“foo”}

Kind regards

robert

Ryan D. [email protected] wrote:

I’m confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#[]= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#[]= prevents MRI[1] from
duplicating the string.

------------------------ freeze_example.rb --------------------
h = {}
frozen = “foo”.freeze
h[frozen] = true

explicitly freezing the key means the String object is stored as-is

p [ :frozen_original_key, frozen.object_id ]
p [ :frozen_key_after_aset, h.keys[0].object_id ]

h = {}
not_frozen = “foo”
h[not_frozen] = true

Not freezing means the key stored in the hash key is a different

object than the one provided by the user.

p [ :not_frozen_original_key, not_frozen.object_id ]
p [ :not_frozen_key_after_aset, h.keys[0].object_id ]

------------------------ Output --------------------------------
[:frozen_original_key, 70096844693360]
[:frozen_key_after_aset, 70096844693360]
[:not_frozen_original_key, 70096844693120]
[:not_frozen_key_after_aset, 70096844694120]

[1] - Verified by reading rb_hash_aset() in hash.c which eventually
calls rb_str_new_frozen() in string.c (ruby/trunk):

rb_str_new_frozen(VALUE orig)
{
VALUE klass, str;

if (OBJ_FROZEN(orig)) return orig;

...

-----Messaggio originale-----
Da: Robert K. [mailto:[email protected]]
Inviato: sabato 12 novembre 2011 12:35
A: ruby-talk ML
Oggetto: Re: The fastest way to read files

On Sat, Nov 12, 2011 at 1:26 AM, Eric W. [email protected]
wrote:

Ryan D. [email protected] wrote:

On Nov 11, 2011, at 15:20 , Robert K. wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan D. [email protected]
wrote:

why freeze it?

The implicit freeze from Hash#[]= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#[]= prevents MRI[1] from
duplicating the string.

Exactly. And in that case we would end up with two objects in memory
where
one is sufficient:

irb(main):008:0> s = “foo”
=> “foo”
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k} => {} irb(main):010:0>
h[s]
=> “foo”
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {“foo”=>“foo”}

Kind regards

robert


remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/


Caselle da 1GB, trasmetti allegati fino a 3GB e in piu’ IMAP, POP3 e
SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Capodanno a Riccione, Pacchetto Relax: Mezza Pensione + bagno turco +
solarium + massaggio. Wifi e parcheggio gratis. 2 giorni euro 199 a
persona
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid978&d)-12

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs