Sven Johansson <sven_u_johansson / spray.se> wrote:
> Tim Hammerquist wrote:
> > Just for our edification, would you run this following code
> > on those same files?
> >
> > require 'digest/md5'
> >
> > files = Dir['*'].select { |f| File.file?(f) }
> >
> > files.each { |filename|
> >     fs_size = File.size(filename) # get size of file from OS
> >
> >     data = File.read(filename)    # read the file
> >     data_size = data.length       # get the size of the data read
> >
> >     hash = Digest::MD5.hexdigest(data)  # calculate hash
> >
> >     # compare amount of data on filesystem
> >     #  with amount of data read
> >     puts "#{hash} - #{filename}: #{data_size}/#{fs_size}"
> > }
> >
>
> Sure. Here it is:
>
> 6ce4ad47bfa79b6c0e48636040c1dfb9 - 001.mp3: 52/50344
> 6ce4ad47bfa79b6c0e48636040c1dfb9 - 002.mp3: 52/52468
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-042.ogg: 335/141226
> 5947035093bbfa22a9e7cf6e69b82a4e - 0022-043.ogg: 335/118208
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-044.ogg: 335/178869
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-045.ogg: 335/181622
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-046.ogg: 335/154218
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-047.ogg: 335/161483
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-048.oog: 335/147162
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-049.ogg: 335/145142
> 5947035093bbfa22a9e7cf6e69b82a4e - 0022-050.ogg: 335/149968
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-057.ogg: 335/161358
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-058.ogg: 335/156026
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-059.ogg: 335/176575
> a7d6f03e275d69b363b9771c9d88e681 - 0022-061.ogg: 335/148704
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-062.ogg: 335/186715
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-069.ogg: 335/173036
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-070.ogg: 335/173752
> 4cac5ea5e666942920aff937aa9b3ee5 - 0022-071.ogg: 335/173581
> [snip]
>
> Which... hmm... does this mean that File.read(filename) will
> only read as far as the first percieved end of line in the
> binary file? Here I thought that would slurp up the entire
> file no matter what, even if it played havoc with the "lines"
> of the file.

You were right.  It read the whole file, right up until the EOF.
But in DOS/Windows text mode, the ASCII 26 character (^Z) is the
EOF marker.

> Given that it seems to read as much per file for each file
> type, it would seem it just reads and hashes the file header
> before it encounters something that it considers to be anwend
> of line.

I'm not an mp3/ogg file format specialist, but it
looks like both your mp3 and ogg files contain that EOF marker
in their headers, and that the first several hundred bytes of
many of these ogg files are the same, hence the identical
hashes.

This is a prime example of why binary read mode is necessary on
a DOS/Win platform.  If you add the 'b' flag to that File read
operation and re-run the script, you should see matching file
sizes and differing hashes.  (I don't have a Windows box at the
moment.)

Cheers!
Tim Hammerquist