Tim Hammerquist wrote:

> Just for our edification, would you run this following code on
> those same files?
>
> require 'digest/md5'
>
> files = Dir['*'].select { |f| File.file?(f) }
>
> files.each { |filename|
>     fs_size = File.size(filename) # get size of file from OS
>
>     data = File.read(filename)    # read the file
>     data_size = data.length       # get the size of the data read
>
>     hash = Digest::MD5.hexdigest(data)  # calculate hash
>
>     # compare amount of data on filesystem
>     #  with amount of data read
>     puts "#{hash} - #{filename}: #{data_size}/#{fs_size}"
> }
>

Sure. Here it is:

6ce4ad47bfa79b6c0e48636040c1dfb9 - 001.mp3: 52/50344
6ce4ad47bfa79b6c0e48636040c1dfb9 - 002.mp3: 52/52468
4cac5ea5e666942920aff937aa9b3ee5 - 0022-042.ogg: 335/141226
5947035093bbfa22a9e7cf6e69b82a4e - 0022-043.ogg: 335/118208
4cac5ea5e666942920aff937aa9b3ee5 - 0022-044.ogg: 335/178869
4cac5ea5e666942920aff937aa9b3ee5 - 0022-045.ogg: 335/181622
4cac5ea5e666942920aff937aa9b3ee5 - 0022-046.ogg: 335/154218
4cac5ea5e666942920aff937aa9b3ee5 - 0022-047.ogg: 335/161483
4cac5ea5e666942920aff937aa9b3ee5 - 0022-048.oog: 335/147162
4cac5ea5e666942920aff937aa9b3ee5 - 0022-049.ogg: 335/145142
5947035093bbfa22a9e7cf6e69b82a4e - 0022-050.ogg: 335/149968
4cac5ea5e666942920aff937aa9b3ee5 - 0022-057.ogg: 335/161358
4cac5ea5e666942920aff937aa9b3ee5 - 0022-058.ogg: 335/156026
4cac5ea5e666942920aff937aa9b3ee5 - 0022-059.ogg: 335/176575
a7d6f03e275d69b363b9771c9d88e681 - 0022-061.ogg: 335/148704
4cac5ea5e666942920aff937aa9b3ee5 - 0022-062.ogg: 335/186715
4cac5ea5e666942920aff937aa9b3ee5 - 0022-069.ogg: 335/173036
4cac5ea5e666942920aff937aa9b3ee5 - 0022-070.ogg: 335/173752
4cac5ea5e666942920aff937aa9b3ee5 - 0022-071.ogg: 335/173581
[snip]

Which... hmm... does this mean that File.read(filename) will only read
as far as the first percieved end of line in the binary file? Here I
thought that would slurp up the entire file no matter what, even if it
played havoc with the "lines" of the file. Given that it seems to read
as much per file for each file type, it would seem it just reads and
hashes the file header before it encounters something that it considers
to be an end of line. But then again, shouldn't all the hashes be
identical for the same header - if they are not, you'd think it'd read
somewhat more or less of the file?