Tim Hammerquist wrote: > Just for our edification, would you run this following code on > those same files? > > require 'digest/md5' > > files = Dir['*'].select { |f| File.file?(f) } > > files.each { |filename| > fs_size = File.size(filename) # get size of file from OS > > data = File.read(filename) # read the file > data_size = data.length # get the size of the data read > > hash = Digest::MD5.hexdigest(data) # calculate hash > > # compare amount of data on filesystem > # with amount of data read > puts "#{hash} - #{filename}: #{data_size}/#{fs_size}" > } > Sure. Here it is: 6ce4ad47bfa79b6c0e48636040c1dfb9 - 001.mp3: 52/50344 6ce4ad47bfa79b6c0e48636040c1dfb9 - 002.mp3: 52/52468 4cac5ea5e666942920aff937aa9b3ee5 - 0022-042.ogg: 335/141226 5947035093bbfa22a9e7cf6e69b82a4e - 0022-043.ogg: 335/118208 4cac5ea5e666942920aff937aa9b3ee5 - 0022-044.ogg: 335/178869 4cac5ea5e666942920aff937aa9b3ee5 - 0022-045.ogg: 335/181622 4cac5ea5e666942920aff937aa9b3ee5 - 0022-046.ogg: 335/154218 4cac5ea5e666942920aff937aa9b3ee5 - 0022-047.ogg: 335/161483 4cac5ea5e666942920aff937aa9b3ee5 - 0022-048.oog: 335/147162 4cac5ea5e666942920aff937aa9b3ee5 - 0022-049.ogg: 335/145142 5947035093bbfa22a9e7cf6e69b82a4e - 0022-050.ogg: 335/149968 4cac5ea5e666942920aff937aa9b3ee5 - 0022-057.ogg: 335/161358 4cac5ea5e666942920aff937aa9b3ee5 - 0022-058.ogg: 335/156026 4cac5ea5e666942920aff937aa9b3ee5 - 0022-059.ogg: 335/176575 a7d6f03e275d69b363b9771c9d88e681 - 0022-061.ogg: 335/148704 4cac5ea5e666942920aff937aa9b3ee5 - 0022-062.ogg: 335/186715 4cac5ea5e666942920aff937aa9b3ee5 - 0022-069.ogg: 335/173036 4cac5ea5e666942920aff937aa9b3ee5 - 0022-070.ogg: 335/173752 4cac5ea5e666942920aff937aa9b3ee5 - 0022-071.ogg: 335/173581 [snip] Which... hmm... does this mean that File.read(filename) will only read as far as the first percieved end of line in the binary file? Here I thought that would slurp up the entire file no matter what, even if it played havoc with the "lines" of the file. Given that it seems to read as much per file for each file type, it would seem it just reads and hashes the file header before it encounters something that it considers to be an end of line. But then again, shouldn't all the hashes be identical for the same header - if they are not, you'd think it'd read somewhat more or less of the file?