On May 4, 2006, at 12:52 PM, Nathan Olberding wrote: > Logan Capaldo wrote: >> On May 4, 2006, at 12:00 AM, Nathan Olberding wrote: >> >>>> What system are you on that 'cat' is so smart? (Maybe cat isn't so >>> >>> 2.txt: MPEG 1.0 Layer I, 128 kbit/s, ? Hz stereo >>> >>> -- >>> Posted via http://www.ruby-forum.com/. >>> >> >> Aha, then I'm going to guess that the encoding is probably Western >> (Mac) which if I am not mistaken is a variation on ISO-8895-1. >> Although the response from file is interesting. Is it possible it was >> saved as UTF16? > > Anything is possible :-) I just write things in text editors and have > this irrational expectation that the text will be immediately usable! > > running "head 2.txt" shows that there's a character or two of > gobbledygook at the start of the file, which I'm guessing is some > indication of the character set used. > > Hmm. Maybe I'll instead grep through the output of "cat #{filename}". > > -- > Posted via http://www.ruby-forum.com/. > Yeah, this makes me think its UTF16 with a BOM (byte-order marking). Here's an example % cat test.txt ÇùÇùÇù·ÅÞÑWhat's new pussy-cat? Hello world! As you can see I saved this file as UTF-16. You can also see that my cat isn't quite as smart as yours, we see the BOM at the beginning. The next step is to write a ruby script that can handle this: % cat text_search.rb require 'iconv' $KCODE='u' pattern = Regexp.new(ARGV.shift) convertor = Iconv.new('utf-8', 'utf-16') begin ARGF.each do |line| out = convertor.iconv(line) if pattern =~ out puts "#{ARGF.lineno}:#{out}" end end ensure convertor.close end Sadly, this will _only_ handle utf-16 encoded files, it can't even handle utf-8. Here's some examples of it in use: % ruby text_search.rb talk test.txt 1:Hello darkness my old friend, I've come to talk to you again. % ruby text_search.rb Hello test.txt 1:Hello darkness my old friend, I've come to talk to you again. 3:Hello world! Detecting utf-16 or ascii isn't so bad, if you know for sure the utf-16 will have a BOM, you just have to look for it. (It's going to be either 0xFEFF or 0xFFFE). On the other hand if you have to handle more than just utf-16 and ascii, things are going to get confusing quick, it's difficult to detect the proper encoding of a file, especially since so many encodings are supersets of ascii.