On May 4, 2006, at 12:52 PM, Nathan Olberding wrote:

> Logan Capaldo wrote:
>> On May 4, 2006, at 12:00 AM, Nathan Olberding wrote:
>>
>>>> What system are you on that 'cat' is so smart? (Maybe cat isn't so
>>>
>>> 2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo
>>>
>>> --  
>>> Posted via http://www.ruby-forum.com/.
>>>
>>
>> Aha, then I'm going to guess that the encoding is probably Western
>> (Mac) which if I am not mistaken is a variation on ISO-8895-1.
>> Although the response from file is interesting. Is it possible it was
>> saved as UTF16?
>
> Anything is possible :-) I just write things in text editors and have
> this irrational expectation that the text will be immediately usable!
>
> running "head 2.txt" shows that there's a character or two of
> gobbledygook at the start of the file, which I'm guessing is some
> indication of the character set used.
>
> Hmm. Maybe I'll instead grep through the output of "cat #{filename}".
>
> --  
> Posted via http://www.ruby-forum.com/.
>

Yeah, this makes me think its UTF16 with a BOM  (byte-order marking).

Here's  an example
% cat test.txt
ÇùÇùÇù·ÅÞÑWhat's new pussy-cat?
Hello world!

As you can see I saved this file as UTF-16. You can also see that my  
cat isn't quite as smart as yours, we see the BOM at the beginning.  
The next step is to write a ruby script that can handle this:

% cat text_search.rb
require 'iconv'
$KCODE='u'
pattern = Regexp.new(ARGV.shift)
convertor = Iconv.new('utf-8', 'utf-16')
begin
   ARGF.each do |line|
     out = convertor.iconv(line)
     if pattern =~ out
       puts "#{ARGF.lineno}:#{out}"
     end
   end
ensure
   convertor.close
end

Sadly, this will _only_ handle utf-16 encoded files, it can't even  
handle utf-8.

Here's some examples of it in use:
% ruby text_search.rb talk test.txt
1:Hello darkness my old friend,  I've come to talk to you again.

% ruby text_search.rb Hello test.txt
1:Hello darkness my old friend,  I've come to talk to you again.
3:Hello world!

Detecting utf-16 or ascii isn't so bad, if you know for sure the  
utf-16 will have a BOM, you just have to look for it. (It's going to  
be either 0xFEFF or 0xFFFE). On the other hand  if you have to handle  
more than just  utf-16 and ascii, things are going to get confusing  
quick, it's difficult to detect the proper encoding of a file,  
especially since so many encodings are supersets of ascii.