On May 4, 2006, at 12:52 PM, Nathan O. wrote:
indication of the character set used.
Hmm. Maybe I’ll instead grep through the output of “cat #{filename}”.
–
Posted via http://www.ruby-forum.com/.
Yeah, this makes me think its UTF16 with a BOM (byte-order marking).
Here’s an example
% cat test.txt
þÿHello darkness my old friend, I’ve come to talk to you again.
What’s new pussy-cat?
Hello world!
As you can see I saved this file as UTF-16. You can also see that my
cat isn’t quite as smart as yours, we see the BOM at the beginning.
The next step is to write a ruby script that can handle this:
% cat text_search.rb
require ‘iconv’
$KCODE=‘u’
pattern = Regexp.new(ARGV.shift)
convertor = Iconv.new(‘utf-8’, ‘utf-16’)
begin
ARGF.each do |line|
out = convertor.iconv(line)
if pattern =~ out
puts “#{ARGF.lineno}:#{out}”
end
end
ensure
convertor.close
end
Sadly, this will only handle utf-16 encoded files, it can’t even
handle utf-8.
Here’s some examples of it in use:
% ruby text_search.rb talk test.txt
1:Hello darkness my old friend, I’ve come to talk to you again.
% ruby text_search.rb Hello test.txt
1:Hello darkness my old friend, I’ve come to talk to you again.
3:Hello world!
Detecting utf-16 or ascii isn’t so bad, if you know for sure the
utf-16 will have a BOM, you just have to look for it. (It’s going to
be either 0xFEFF or 0xFFFE). On the other hand if you have to handle
more than just utf-16 and ascii, things are going to get confusing
quick, it’s difficult to detect the proper encoding of a file,
especially since so many encodings are supersets of ascii.