Forum: Ruby efficient regex scanning

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
36c57b679019ae2b04b746f756d21966?d=identicon&s=25 Trochalakis Christos (Guest)
on 2007-06-06 12:56
(Received via mailing list)
Hello there,

I wan't to extract all the words from a file and so i wrote the
following code:

file = ARGV[0]
File.open('output','w') {|f|
        IO.read(file).scan(/\w+/).each{|w| f.print w}
}

The problem with this code is that it stores all the words in an array
which is not so good in terms of efficiency.
Is there a better way to do it?
Something like IO.read(file).each_scan { foo }

Thanks
Christos
A5b01739148d8795e99582444361a1bc?d=identicon&s=25 Ola Bini (Guest)
on 2007-06-06 13:01
(Received via mailing list)
Trochalakis Christos wrote:
> The problem with this code is that it stores all the words in an array
> which is not so good in terms of efficiency.
> Is there a better way to do it?
> Something like IO.read(file).each_scan { foo }
>
> Thanks
> Christos
>
>
>
>
Scan takes a block form:

ri String.scan


        IO.read(file).scan(/\w+/) {|w| f.print w}


Cheers

--
 Ola Bini (http://ola-bini.blogspot.com)
 JRuby Core Developer
 Developer, ThoughtWorks Studios (http://studios.thoughtworks.com)

 "Yields falsehood when quined" yields falsehood when quined.
0158871402c1ecfa57952e8a379cfd10?d=identicon&s=25 Daniel Lucraft (lucraft)
on 2007-06-06 13:08
Trochalakis Christos wrote:
> Hello there,
>
> The problem with this code is that it stores all the words in an array
> which is not so good in terms of efficiency.
> Is there a better way to do it?
> Something like IO.read(file).each_scan { foo }
>
> Thanks
> Christos

Does just using a block with scan do what you need?

IO.read(file).scan(/\w+/) { |word| f.print word }

http://www.ruby-doc.org/core/classes/String.html#M000827

best,
Dan
1fba4539b6cafe2e60a2916fa184fc2f?d=identicon&s=25 unknown (Guest)
on 2007-06-06 13:10
(Received via mailing list)
Hi --

On Wed, 6 Jun 2007, Trochalakis Christos wrote:

> The problem with this code is that it stores all the words in an array
> which is not so good in terms of efficiency.
> Is there a better way to do it?
> Something like IO.read(file).each_scan { foo }

You could do something like this (untested, and reversing your logic
somewhat):

   File.open(file).each {|line| f.print(line.scan(/\w+/)) }

(You might want to join them with a space or something so they don't
all run together.)


David
36c57b679019ae2b04b746f756d21966?d=identicon&s=25 Trochalakis Christos (Guest)
on 2007-06-06 13:25
(Received via mailing list)
On Jun 6, 2:00 pm, Ola Bini <ola.b...@gmail.com> wrote:
>
> ri String.scan
>
>         IO.read(file).scan(/\w+/) {|w| f.print w}
>
> Cheers

Thanks a lot!
I suppose should have checked first :)
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-06-06 14:22
(Received via mailing list)
On 06.06.2007 13:08, dblack@wobblini.net wrote:
>> File.open('output','w') {|f|
>
>   File.open(file).each {|line| f.print(line.scan(/\w+/)) }
>
> (You might want to join them with a space or something so they don't
> all run together.)

You're not closing the IO.  I know it's not an issue for a small script
but...

I'd do this:

ARGF.each {|line| puts line.scan /\w+/}

:-)

Kind regards

  robert
1fba4539b6cafe2e60a2916fa184fc2f?d=identicon&s=25 unknown (Guest)
on 2007-06-06 14:28
(Received via mailing list)
Hi --

On Wed, 6 Jun 2007, Robert Klemme wrote:

>>> file = ARGV[0]
>> somewhat):
>>
>>   File.open(file).each {|line| f.print(line.scan(/\w+/)) }
>>
>> (You might want to join them with a space or something so they don't
>> all run together.)
>
> You're not closing the IO.  I know it's not an issue for a small script
> but...

It's not a complete script; I was only showing one line.  At the very
least it's not going to run unless you assign something to f :-)


David
47b1910084592eb77a032bc7d8d1a84e?d=identicon&s=25 Joel VanderWerf (Guest)
on 2007-06-06 19:52
(Received via mailing list)
Trochalakis Christos wrote:
> The problem with this code is that it stores all the words in an array
> which is not so good in terms of efficiency.
> Is there a better way to do it?
> Something like IO.read(file).each_scan { foo }

Here's a thought. Note that it doesn't handle //m regexen. Like David's
and Robert's solutions, it doesn't read the whole at once. (I guess one
could check for pat.options&Regexp::MULTILINE, and read the whole IO in
that case.)

class IO
   def scan pat
     if block_given?
       each {|line| line.scan(pat) {|s| yield s} }
     else
       read.scan(pat)
     end
   end
end

File.open(filename) do |f|
   f.scan(/\w+/) {|word| puts word}
end
This topic is locked and can not be replied to.