Logan Capaldo wrote:
> Here's  an example
> % cat test.txt
> ��Hello darkness my old friend,  I've come to talk to you again.
> What's new pussy-cat?
> Hello world!

Mine comes out exactly the same, even through cat. I just never noticed 
the first characters before.

> As you can see I saved this file as UTF-16. You can also see that my
> cat isn't quite as smart as yours, we see the BOM at the beginning.
> The next step is to write a ruby script that can handle this:

> Sadly, this will _only_ handle utf-16 encoded files, it can't even
> handle utf-8.

Here's the code I've decided I'm happy with:

#!/usr/bin/env ruby

search_term = /#{ARGV[0]}/
notes_dir = Dir.new(".").to_a - ['.', '..']
positive_results = []

notes_dir.each do |note|
        fl = `cat "#{note}"`
        if fl =~ search_term
                positive_results.push(note)
        end
end

positive_results.uniq.each do |x|
        puts "\"#{x}\""
end

The search script is in the directory I want to traverse (~/notes). I 
just want to get the names of files that contain the search terms. From 
there, I can pipe the output to another script.

Come to think of it, I'm still only checking against ARGV[0] as a search 
term. I should be iterating through ARGV. Easy fix.

> 
> Detecting utf-16 or ascii isn't so bad, if you know for sure the
> utf-16 will have a BOM, you just have to look for it. (It's going to
> be either 0xFEFF or 0xFFFE). On the other hand  if you have to handle
> more than just  utf-16 and ascii, things are going to get confusing
> quick, it's difficult to detect the proper encoding of a file,
> especially since so many encodings are supersets of ascii.

I'll just let `cat` do that for me :-)

-- 
Posted via http://www.ruby-forum.com/.