Searching Files

nko · May 2, 2006, 11:16pm

I’m trying to write a simple script to go through all the files in a
specific directory and search for search terms. You should,
hypothetically, be able to run:

search search_term

and find all the applicable results. Here’s my code:

path = “/home/nathano/notes”
notes_dir = Dir.new(path).to_a - [’.’, ‘…’]
matches = []

matches.uniq.each do |x|
puts x
end

It seems to actually work on rare occasion, but obviously it isn’t
working as expected. I’ve also tried converting each file to a massive
string (readlines.to_s) and "include?"ing that string, but to no avail.

Insights?

nko · May 3, 2006, 2:27pm

Nathan O. wrote:

I’m trying to write a simple script to go through all the files in a
specific directory and search for search terms. You should,
hypothetically, be able to run:

search search_term

drop this in a file and run it… it should be fairly easy to adapt

this

to what you want

require ‘find’

search_term = /pear/

Find.find("./") do |path|
next unless File.file?(path)
File.open(path) do |file|
lineno = 0
file.each do |line|
lineno += 1
if line =~ search_term
puts "#{path}:#{lineno} " + line
end
end
end
end

END
pear
apple
orange
banana
kiwi
mango

nko · May 3, 2006, 5:49pm

Michael G. wrote:

require ‘find’

search_term = /pear/

Find.find("./") do |path|
next unless File.file?(path)
File.open(path) do |file|
lineno = 0
file.each do |line|
lineno += 1
if line =~ search_term
puts "#{path}:#{lineno} " + line
end
end
end
end

I’ve played with this and it seems to have the same problem as my
original code: it will only find maybe 10% of the results I expect (so
far only one result in either implementation, no matter the search). In
this case, if I hard-code the search_term, then the very script I’m
running is the only file that will ever produce a result (as it is in
the directory I’m traversing).

Sorry if I’ve ignored any other replies. I’m using the ruby-forum
interface, which seems to have been down for some time yesterday.

nko · May 3, 2006, 7:03pm

On May 2, 2006, at 5:16 PM, Nathan O. wrote:

matches = []
end

Insights?

–
Posted via http://www.ruby-forum.com/.

Have you seen File.find ?
http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/classes/Find.html

nko · May 4, 2006, 2:01am

Dave H. wrote:

That strongly implies that the problem is with your search terms, not
your code. Or to put it another way, that what you think is in those
files, isn’t actually there.

For example, if you try to search for something like /this phrase/ in a
file that’s been saved as UTF-16 and your searching program doesn’t
know that, then you won’t find that phrase, because it doesn’t exist.
In the file, it would be more like ?t?h?i?s? ?p?h?r?a?s?e where the ?'s
represent an ASCII “NUL” character, 00.

If you’re trying to search Word files, or RTF files, then there’s all
sorts of stuff that might be stored between the items of your search.

So what, exactly, are you trying to search for, and are you positive
that it’s actually in the files?

Excellent, excellent question!

Some of these files definitely have non-ASCII encodings. This… is kind
of frustrating. cat will show the contents of the files, vim will edit
the files… what do I do in my code to deal with this? Not every file
is of the same encoding.

nko · May 4, 2006, 1:45am

On May 3, 2006, at 8:49, Nathan O. wrote:

  lineno += 1
  if line =~ search_term
    puts "#{path}:#{lineno} " + line
  end
end
end
end
I’ve played with this and it seems to have the same problem as my
original code: it will only find maybe 10% of the results I expect (so
far only one result in either implementation, no matter the search).

That strongly implies that the problem is with your search terms, not
your code. Or to put it another way, that what you think is in those
files, isn’t actually there.

For example, if you try to search for something like /this phrase/ in a
file that’s been saved as UTF-16 and your searching program doesn’t
know that, then you won’t find that phrase, because it doesn’t exist.
In the file, it would be more like ?t?h?i?s? ?p?h?r?a?s?e where the ?'s
represent an ASCII “NUL” character, 00.

If you’re trying to search Word files, or RTF files, then there’s all
sorts of stuff that might be stored between the items of your search.

So what, exactly, are you trying to search for, and are you positive
that it’s actually in the files?

nko · May 4, 2006, 6:00am

Logan C. wrote:

Assuming you know the encodings of the files, you may want to look
into iconv.
Also check out $KCODE, or the -K command line option
ri Iconv
ri Iconv::iconv
ri Iconv#iconv

I’ll have to check that out. Thanks!

What system are you on that ‘cat’ is so smart? (Maybe cat isn’t so
smart, the encodings may be supersets of ASCII although that makes me
wonder why the ruby code doesn’t work) Alternatively, what is the
encodings of the files?

OpenBSD 3.8. The particular file I’m looking at as my main test subject
was saved with TextEdit on MacOS X 10.2. In a funny twist, here’s what
the file shell command, run from BASH on OpenBSD 3.8, reports about the
file in question:

2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo

nko · May 4, 2006, 3:17am

On May 3, 2006, at 8:01 PM, Nathan O. wrote:

Excellent, excellent question!

Some of these files definitely have non-ASCII encodings. This… is
kind
of frustrating. cat will show the contents of the files, vim will edit
the files… what do I do in my code to deal with this? Not every file
is of the same encoding.

Assuming you know the encodings of the files, you may want to look
into iconv.
Also check out $KCODE, or the -K command line option
ri Iconv
ri Iconv::iconv
ri Iconv#iconv

What system are you on that ‘cat’ is so smart? (Maybe cat isn’t so
smart, the encodings may be supersets of ASCII although that makes me
wonder why the ruby code doesn’t work) Alternatively, what is the
encodings of the files?

nko · May 4, 2006, 7:20am

On May 4, 2006, at 12:00 AM, Nathan O. wrote:

What system are you on that ‘cat’ is so smart? (Maybe cat isn’t so

2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo

–
Posted via http://www.ruby-forum.com/.

Aha, then I’m going to guess that the encoding is probably Western
(Mac) which if I am not mistaken is a variation on ISO-8895-1.
Although the response from file is interesting. Is it possible it was
saved as UTF16?

nko · May 4, 2006, 6:52pm

Logan C. wrote:

On May 4, 2006, at 12:00 AM, Nathan O. wrote:

What system are you on that ‘cat’ is so smart? (Maybe cat isn’t so

2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo

–
Posted via http://www.ruby-forum.com/.

Aha, then I’m going to guess that the encoding is probably Western
(Mac) which if I am not mistaken is a variation on ISO-8895-1.
Although the response from file is interesting. Is it possible it was
saved as UTF16?

Anything is possible I just write things in text editors and have
this irrational expectation that the text will be immediately usable!

running “head 2.txt” shows that there’s a character or two of
gobbledygook at the start of the file, which I’m guessing is some
indication of the character set used.

Hmm. Maybe I’ll instead grep through the output of “cat #{filename}”.

nko · May 4, 2006, 7:37pm

Logan C. wrote:

Here’s an example
% cat test.txt
ï¿½ï¿½Hello darkness my old friend, I’ve come to talk to you again.
What’s new pussy-cat?
Hello world!

Mine comes out exactly the same, even through cat. I just never noticed
the first characters before.

As you can see I saved this file as UTF-16. You can also see that my
cat isn’t quite as smart as yours, we see the BOM at the beginning.
The next step is to write a ruby script that can handle this:

Sadly, this will only handle utf-16 encoded files, it can’t even
handle utf-8.

Here’s the code I’ve decided I’m happy with:

#!/usr/bin/env ruby

search_term = /#{ARGV[0]}/
notes_dir = Dir.new(".").to_a - [’.’, ‘…’]
positive_results = []

notes_dir.each do |note|
fl = cat "#{note}"
if fl =~ search_term
positive_results.push(note)
end
end

positive_results.uniq.each do |x|
puts “”#{x}""
end

The search script is in the directory I want to traverse (~/notes). I
just want to get the names of files that contain the search terms. From
there, I can pipe the output to another script.

Come to think of it, I’m still only checking against ARGV[0] as a search
term. I should be iterating through ARGV. Easy fix.

Detecting utf-16 or ascii isn’t so bad, if you know for sure the
utf-16 will have a BOM, you just have to look for it. (It’s going to
be either 0xFEFF or 0xFFFE). On the other hand if you have to handle
more than just utf-16 and ascii, things are going to get confusing
quick, it’s difficult to detect the proper encoding of a file,
especially since so many encodings are supersets of ascii.

I’ll just let cat do that for me

nko · May 4, 2006, 7:27pm

On May 4, 2006, at 12:52 PM, Nathan O. wrote:

indication of the character set used.

Hmm. Maybe I’ll instead grep through the output of “cat #{filename}”.

–
Posted via http://www.ruby-forum.com/.

Yeah, this makes me think its UTF16 with a BOM (byte-order marking).

Here’s an example
% cat test.txt
þÿHello darkness my old friend, I’ve come to talk to you again.
What’s new pussy-cat?
Hello world!

As you can see I saved this file as UTF-16. You can also see that my
cat isn’t quite as smart as yours, we see the BOM at the beginning.
The next step is to write a ruby script that can handle this:

% cat text_search.rb
require ‘iconv’
$KCODE=‘u’
pattern = Regexp.new(ARGV.shift)
convertor = Iconv.new(‘utf-8’, ‘utf-16’)
begin
ARGF.each do |line|
out = convertor.iconv(line)
if pattern =~ out
puts “#{ARGF.lineno}:#{out}”
end
end
ensure
convertor.close
end

Sadly, this will only handle utf-16 encoded files, it can’t even
handle utf-8.

Here’s some examples of it in use:
% ruby text_search.rb talk test.txt
1:Hello darkness my old friend, I’ve come to talk to you again.

% ruby text_search.rb Hello test.txt
1:Hello darkness my old friend, I’ve come to talk to you again.
3:Hello world!

Detecting utf-16 or ascii isn’t so bad, if you know for sure the
utf-16 will have a BOM, you just have to look for it. (It’s going to
be either 0xFEFF or 0xFFFE). On the other hand if you have to handle
more than just utf-16 and ascii, things are going to get confusing
quick, it’s difficult to detect the proper encoding of a file,
especially since so many encodings are supersets of ascii.

nko · May 4, 2006, 8:07pm

On May 4, 2006, at 1:37 PM, Nathan O. wrote:

search
I’ll just let cat do that for me

The problem with that is that cat isn’t really doing anything, and as
soon as someone saves a multi-byte character to that file, all hell
is going to break loose. cat is doing something along the lines of

while(line = getline() ) {
for(i = 0; i < length(line); i++) {
if isprint(line[i]) {
print line[i]
}
}

which in the case that it just happens to be single-byte characters
it will skip the nulls. If the source text contains non-english
characters, etc. those bytes won’t just be nulls any more and if it
is something printable (like the BOM at the beginning of the file for
instance) it’s going to create the wrong output.

nko · May 4, 2006, 8:31pm

Logan C. wrote:

The problem with that is that cat isn’t really doing anything, and as
soon as someone saves a multi-byte character to that file, all hell
is going to break loose. cat is doing something along the lines of

while(line = getline() ) {
for(i = 0; i < length(line); i++) {
if isprint(line[i]) {
print line[i]
}
}

which in the case that it just happens to be single-byte characters
it will skip the nulls. If the source text contains non-english
characters, etc. those bytes won’t just be nulls any more and if it
is something printable (like the BOM at the beginning of the file for
instance) it’s going to create the wrong output.

Shoot, you’re right… this is weird. Using cat straight from the
command line produces text I can read, but searching through that output
with my script is broken. How that works, I’ll never know.

UTF16 certainly isn’t the only encoding I expect to see if I’m going to
be flexible in which text editors I use. I don’t like giving up, but
meh, it just isn’t that big of a deal. Thank you very much for the help,
though! I didn’t even know about iconv before!