Sven Johansson <sven_u_johansson / spray.se> wrote: > Tim Hammerquist wrote: > > Just for our edification, would you run this following code > > on those same files? > > > > require 'digest/md5' > > > > files = Dir['*'].select { |f| File.file?(f) } > > > > files.each { |filename| > > fs_size = File.size(filename) # get size of file from OS > > > > data = File.read(filename) # read the file > > data_size = data.length # get the size of the data read > > > > hash = Digest::MD5.hexdigest(data) # calculate hash > > > > # compare amount of data on filesystem > > # with amount of data read > > puts "#{hash} - #{filename}: #{data_size}/#{fs_size}" > > } > > > > Sure. Here it is: > > 6ce4ad47bfa79b6c0e48636040c1dfb9 - 001.mp3: 52/50344 > 6ce4ad47bfa79b6c0e48636040c1dfb9 - 002.mp3: 52/52468 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-042.ogg: 335/141226 > 5947035093bbfa22a9e7cf6e69b82a4e - 0022-043.ogg: 335/118208 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-044.ogg: 335/178869 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-045.ogg: 335/181622 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-046.ogg: 335/154218 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-047.ogg: 335/161483 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-048.oog: 335/147162 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-049.ogg: 335/145142 > 5947035093bbfa22a9e7cf6e69b82a4e - 0022-050.ogg: 335/149968 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-057.ogg: 335/161358 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-058.ogg: 335/156026 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-059.ogg: 335/176575 > a7d6f03e275d69b363b9771c9d88e681 - 0022-061.ogg: 335/148704 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-062.ogg: 335/186715 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-069.ogg: 335/173036 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-070.ogg: 335/173752 > 4cac5ea5e666942920aff937aa9b3ee5 - 0022-071.ogg: 335/173581 > [snip] > > Which... hmm... does this mean that File.read(filename) will > only read as far as the first percieved end of line in the > binary file? Here I thought that would slurp up the entire > file no matter what, even if it played havoc with the "lines" > of the file. You were right. It read the whole file, right up until the EOF. But in DOS/Windows text mode, the ASCII 26 character (^Z) is the EOF marker. > Given that it seems to read as much per file for each file > type, it would seem it just reads and hashes the file header > before it encounters something that it considers to be anwend > of line. I'm not an mp3/ogg file format specialist, but it looks like both your mp3 and ogg files contain that EOF marker in their headers, and that the first several hundred bytes of many of these ogg files are the same, hence the identical hashes. This is a prime example of why binary read mode is necessary on a DOS/Win platform. If you add the 'b' flag to that File read operation and re-run the script, you should see matching file sizes and differing hashes. (I don't have a Windows box at the moment.) Cheers! Tim Hammerquist