On May 4, 2006, at 1:37 PM, Nathan Olberding wrote: > Logan Capaldo wrote: >> Here's an example >> % cat test.txt >> ��Hello darkness my old friend, I've come to talk to you again. >> What's new pussy-cat? >> Hello world! > > Mine comes out exactly the same, even through cat. I just never > noticed > the first characters before. > >> As you can see I saved this file as UTF-16. You can also see that my >> cat isn't quite as smart as yours, we see the BOM at the beginning. >> The next step is to write a ruby script that can handle this: > >> Sadly, this will _only_ handle utf-16 encoded files, it can't even >> handle utf-8. > > Here's the code I've decided I'm happy with: > > #!/usr/bin/env ruby > > search_term = /#{ARGV[0]}/ > notes_dir = Dir.new(".").to_a - ['.', '..'] > positive_results = [] > > notes_dir.each do |note| > fl = `cat "#{note}"` > if fl =~ search_term > positive_results.push(note) > end > end > > positive_results.uniq.each do |x| > puts "\"#{x}\"" > end > > The search script is in the directory I want to traverse (~/notes). I > just want to get the names of files that contain the search terms. > From > there, I can pipe the output to another script. > > Come to think of it, I'm still only checking against ARGV[0] as a > search > term. I should be iterating through ARGV. Easy fix. > >> >> Detecting utf-16 or ascii isn't so bad, if you know for sure the >> utf-16 will have a BOM, you just have to look for it. (It's going to >> be either 0xFEFF or 0xFFFE). On the other hand if you have to handle >> more than just utf-16 and ascii, things are going to get confusing >> quick, it's difficult to detect the proper encoding of a file, >> especially since so many encodings are supersets of ascii. > > I'll just let `cat` do that for me :-) > The problem with that is that cat isn't really doing anything, and as soon as someone saves a multi-byte character to that file, all hell is going to break loose. cat is doing something along the lines of while(line = getline() ) { for(i = 0; i < length(line); i++) { if isprint(line[i]) { print line[i] } } which in the case that it just happens to be single-byte characters it will skip the nulls. If the source text contains non-english characters, etc. those bytes won't just be nulls any more and if it is something printable (like the BOM at the beginning of the file for instance) it's going to create the wrong output. > -- > Posted via http://www.ruby-forum.com/. >