Forum: Ruby Searching Files

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Nathan O. (Guest)
on 2006-05-03 01:16
I'm trying to write a simple script to go through all the files in a
specific directory and search for search terms. You should,
hypothetically, be able to run:

search search_term

and find all the applicable results. Here's my code:

path = "/home/nathano/notes"
notes_dir = Dir.new(path).to_a - ['.', '..']
matches = []

notes_dir.each do |note|
	ARGV.each do |pattern|
		file_contents = open(path + "/" + note, "r")
		file_contents.each do |line|
			if line.include?(pattern.to_s)
				matches.push(note)
			end
		end
	end
end

matches.uniq.each do |x|
	puts x
end


It seems to actually work on rare occasion, but obviously it isn't
working as expected. I've also tried converting each file to a massive
string (readlines.to_s) and "include?"ing that string, but to no avail.

Insights?
Michael G. (Guest)
on 2006-05-03 16:27
Nathan O. wrote:
> I'm trying to write a simple script to go through all the files in a
> specific directory and search for search terms. You should,
> hypothetically, be able to run:
>
> search search_term
>


#
# drop this in a file and run it... it should be fairly easy to adapt
this
# to what you want
#
require 'find'

search_term = /pear/

Find.find("./") do |path|
  next unless File.file?(path)
  File.open(path) do |file|
    lineno = 0
    file.each do |line|
      lineno += 1
      if line =~ search_term
        puts "#{path}:#{lineno} " + line
      end
    end
  end
end

__END__
pear
apple
orange
banana
kiwi
mango
Nathan O. (Guest)
on 2006-05-03 19:49
Michael G. wrote:
> require 'find'
>
> search_term = /pear/
>
> Find.find("./") do |path|
>   next unless File.file?(path)
>   File.open(path) do |file|
>     lineno = 0
>     file.each do |line|
>       lineno += 1
>       if line =~ search_term
>         puts "#{path}:#{lineno} " + line
>       end
>     end
>   end
> end

I've played with this and it seems to have the same problem as my
original code: it will only find maybe 10% of the results I expect (so
far only one result in either implementation, no matter the search). In
this case, if I hard-code the search_term, then the very script I'm
running is the only file that will ever produce a result (as it is in
the directory I'm traversing).

Sorry if I've ignored any other replies. I'm using the ruby-forum
interface, which seems to have been down for some time yesterday.
Logan C. (Guest)
on 2006-05-03 21:03
(Received via mailing list)
On May 2, 2006, at 5:16 PM, Nathan O. wrote:

> matches = []
> end
>
> Insights?
>
> --
> Posted via http://www.ruby-forum.com/.
>

Have you seen File.find ?
http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/cl...
Dave H. (Guest)
on 2006-05-04 03:45
(Received via mailing list)
On May 3, 2006, at 8:49, Nathan O. wrote:

>>       lineno += 1
>>       if line =~ search_term
>>         puts "#{path}:#{lineno} " + line
>>       end
>>     end
>>   end
>> end
>
> I've played with this and it seems to have the same problem as my
> original code: it will only find maybe 10% of the results I expect (so
> far only one result in either implementation, no matter the search).

That strongly implies that the problem is with your search terms, not
your code. Or to put it another way, that what you think is in those
files, isn't actually there.

For example, if you try to search for something like /this phrase/ in a
file that's been saved as UTF-16 and your searching program doesn't
know that, then you won't find that phrase, because it doesn't exist.
In the file, it would be more like ?t?h?i?s? ?p?h?r?a?s?e where the ?'s
represent an ASCII "NUL" character, 00.

If you're trying to search Word files, or RTF files, then there's all
sorts of stuff that might be stored between the items of your search.

So what, exactly, are you trying to search for, and are you positive
that it's actually in the files?
Nathan O. (Guest)
on 2006-05-04 04:01
Dave H. wrote:
> That strongly implies that the problem is with your search terms, not
> your code. Or to put it another way, that what you think is in those
> files, isn't actually there.
>
> For example, if you try to search for something like /this phrase/ in a
> file that's been saved as UTF-16 and your searching program doesn't
> know that, then you won't find that phrase, because it doesn't exist.
> In the file, it would be more like ?t?h?i?s? ?p?h?r?a?s?e where the ?'s
> represent an ASCII "NUL" character, 00.
>
> If you're trying to search Word files, or RTF files, then there's all
> sorts of stuff that might be stored between the items of your search.
>
> So what, exactly, are you trying to search for, and are you positive
> that it's actually in the files?

Excellent, excellent question!

Some of these files definitely have non-ASCII encodings. This... is kind
of frustrating. cat will show the contents of the files, vim will edit
the files... what do I do in my code to deal with this? Not every file
is of the same encoding.
Logan C. (Guest)
on 2006-05-04 05:17
(Received via mailing list)
On May 3, 2006, at 8:01 PM, Nathan O. wrote:

> Excellent, excellent question!
>
> Some of these files definitely have non-ASCII encodings. This... is
> kind
> of frustrating. cat will show the contents of the files, vim will edit
> the files... what do I do in my code to deal with this? Not every file
> is of the same encoding.

Assuming you know the encodings of the files, you may want to look
into iconv.
Also check out $KCODE, or the -K command line option
ri Iconv
ri Iconv::iconv
ri Iconv#iconv

What system are you on that 'cat' is so smart? (Maybe cat isn't so
smart, the encodings may be supersets of ASCII although that makes me
wonder why the ruby code doesn't work) Alternatively, what is the
encodings of the files?
Nathan O. (Guest)
on 2006-05-04 08:00
Logan C. wrote:
> Assuming you know the encodings of the files, you may want to look
> into iconv.
> Also check out $KCODE, or the -K command line option
> ri Iconv
> ri Iconv::iconv
> ri Iconv#iconv

I'll have to check that out. Thanks!

> What system are you on that 'cat' is so smart? (Maybe cat isn't so
> smart, the encodings may be supersets of ASCII although that makes me
> wonder why the ruby code doesn't work) Alternatively, what is the
> encodings of the files?

OpenBSD 3.8. The particular file I'm looking at as my main test subject
was saved with TextEdit on MacOS X 10.2. In a funny twist, here's what
the file shell command, run from BASH on OpenBSD 3.8, reports about the
file in question:

2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo
Logan C. (Guest)
on 2006-05-04 09:20
(Received via mailing list)
On May 4, 2006, at 12:00 AM, Nathan O. wrote:

>> What system are you on that 'cat' is so smart? (Maybe cat isn't so
>
> 2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo
>
> --
> Posted via http://www.ruby-forum.com/.
>

Aha, then I'm going to guess that the encoding is probably Western
(Mac) which if I am not mistaken is a variation on ISO-8895-1.
Although the response from file is interesting. Is it possible it was
saved as UTF16?
Nathan O. (Guest)
on 2006-05-04 20:52
Logan C. wrote:
> On May 4, 2006, at 12:00 AM, Nathan O. wrote:
>
>>> What system are you on that 'cat' is so smart? (Maybe cat isn't so
>>
>> 2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo
>>
>> --
>> Posted via http://www.ruby-forum.com/.
>>
>
> Aha, then I'm going to guess that the encoding is probably Western
> (Mac) which if I am not mistaken is a variation on ISO-8895-1.
> Although the response from file is interesting. Is it possible it was
> saved as UTF16?

Anything is possible :-) I just write things in text editors and have
this irrational expectation that the text will be immediately usable!

running "head 2.txt" shows that there's a character or two of
gobbledygook at the start of the file, which I'm guessing is some
indication of the character set used.

Hmm. Maybe I'll instead grep through the output of "cat #{filename}".
Logan C. (Guest)
on 2006-05-04 21:27
(Received via mailing list)
On May 4, 2006, at 12:52 PM, Nathan O. wrote:

>>
> indication of the character set used.
>
> Hmm. Maybe I'll instead grep through the output of "cat #{filename}".
>
> --
> Posted via http://www.ruby-forum.com/.
>

Yeah, this makes me think its UTF16 with a BOM  (byte-order marking).

Here's  an example
% cat test.txt
þÿHello darkness my old friend,  I've come to talk to you again.
What's new pussy-cat?
Hello world!

As you can see I saved this file as UTF-16. You can also see that my
cat isn't quite as smart as yours, we see the BOM at the beginning.
The next step is to write a ruby script that can handle this:

% cat text_search.rb
require 'iconv'
$KCODE='u'
pattern = Regexp.new(ARGV.shift)
convertor = Iconv.new('utf-8', 'utf-16')
begin
   ARGF.each do |line|
     out = convertor.iconv(line)
     if pattern =~ out
       puts "#{ARGF.lineno}:#{out}"
     end
   end
ensure
   convertor.close
end

Sadly, this will _only_ handle utf-16 encoded files, it can't even
handle utf-8.

Here's some examples of it in use:
% ruby text_search.rb talk test.txt
1:Hello darkness my old friend,  I've come to talk to you again.

% ruby text_search.rb Hello test.txt
1:Hello darkness my old friend,  I've come to talk to you again.
3:Hello world!

Detecting utf-16 or ascii isn't so bad, if you know for sure the
utf-16 will have a BOM, you just have to look for it. (It's going to
be either 0xFEFF or 0xFFFE). On the other hand  if you have to handle
more than just  utf-16 and ascii, things are going to get confusing
quick, it's difficult to detect the proper encoding of a file,
especially since so many encodings are supersets of ascii.
Nathan O. (Guest)
on 2006-05-04 21:37
Logan C. wrote:
> Here's  an example
> % cat test.txt
> ��Hello darkness my old friend,  I've come to talk to you again.
> What's new pussy-cat?
> Hello world!

Mine comes out exactly the same, even through cat. I just never noticed
the first characters before.

> As you can see I saved this file as UTF-16. You can also see that my
> cat isn't quite as smart as yours, we see the BOM at the beginning.
> The next step is to write a ruby script that can handle this:

> Sadly, this will _only_ handle utf-16 encoded files, it can't even
> handle utf-8.

Here's the code I've decided I'm happy with:

#!/usr/bin/env ruby

search_term = /#{ARGV[0]}/
notes_dir = Dir.new(".").to_a - ['.', '..']
positive_results = []

notes_dir.each do |note|
        fl = `cat "#{note}"`
        if fl =~ search_term
                positive_results.push(note)
        end
end

positive_results.uniq.each do |x|
        puts "\"#{x}\""
end

The search script is in the directory I want to traverse (~/notes). I
just want to get the names of files that contain the search terms. From
there, I can pipe the output to another script.

Come to think of it, I'm still only checking against ARGV[0] as a search
term. I should be iterating through ARGV. Easy fix.

>
> Detecting utf-16 or ascii isn't so bad, if you know for sure the
> utf-16 will have a BOM, you just have to look for it. (It's going to
> be either 0xFEFF or 0xFFFE). On the other hand  if you have to handle
> more than just  utf-16 and ascii, things are going to get confusing
> quick, it's difficult to detect the proper encoding of a file,
> especially since so many encodings are supersets of ascii.

I'll just let `cat` do that for me :-)
Logan C. (Guest)
on 2006-05-04 22:07
(Received via mailing list)
On May 4, 2006, at 1:37 PM, Nathan O. wrote:

>
>
>
> search
> I'll just let `cat` do that for me :-)
>
The problem with that is that cat isn't really doing anything, and as
soon as someone saves a multi-byte character to that file, all hell
is going to break loose. cat is doing something along the lines of

while(line = getline() ) {
    for(i =  0; i < length(line); i++) {
      if isprint(line[i]) {
         print line[i]
      }
}

which in the case that it just happens to be single-byte characters
it will skip the nulls. If the source text contains non-english
characters, etc. those bytes won't just be nulls any more and if it
is something printable (like the BOM at the beginning of the file for
instance) it's going to create the wrong output.
Nathan O. (Guest)
on 2006-05-04 22:31
Logan C. wrote:
> The problem with that is that cat isn't really doing anything, and as
> soon as someone saves a multi-byte character to that file, all hell
> is going to break loose. cat is doing something along the lines of
>
> while(line = getline() ) {
>     for(i =  0; i < length(line); i++) {
>       if isprint(line[i]) {
>          print line[i]
>       }
> }
>
> which in the case that it just happens to be single-byte characters
> it will skip the nulls. If the source text contains non-english
> characters, etc. those bytes won't just be nulls any more and if it
> is something printable (like the BOM at the beginning of the file for
> instance) it's going to create the wrong output.

Shoot, you're right... this is weird. Using cat straight from the
command line produces text I can read, but searching through that output
with my script is broken. How that works, I'll never know.

UTF16 certainly isn't the only encoding I expect to see if I'm going to
be flexible in which text editors I use. I don't like giving up, but
meh, it just isn't that big of a deal. Thank you very much for the help,
though! I didn't even know about iconv before!
This topic is locked and can not be replied to.