Bill Kelly <billk / cts.com> wrote:
> From: "rtilley" <rtilley / vt.edu>
>>
>> I'm calculating md5 checksums on very large files (2 GB). This is a
>> safe way to do so, right? Also... is the file closed when the block
>> exits? I'm using 'rb' as this is used on Windows and Linux computers.
>>
>> md5 = Digest::MD5.new()
>> File.open(file, 'rb').each {|line| md5.update(line)}
>
> Hi - does the file really contain text lines?  Or is it a file
> full of binary data.  If it's a binary file, there may be no
> guarantee the whole thing isn't one very long "line".  In that
> case I'd recommend reading it in chunks.
>
> Untested:
>
> md5 = Digest::MD5.new()
> File.open(file, 'rb') do |io|
>  while (buf = io.read(4096)) && buf.length > 0
>    md5.update(buf)
>  end
> end

io.read will return nil at EOF so your test for positive length is basically 
obsolete.  Also, for reasons of error checking I'd place the digest creation 
inside the block because then the digest is never created if the file cannot 
be opened:

md5 = File.open(file, 'rb') do |io|
 dig = Digest::MD5.new
 while (buf = io.read(4096))
   dig.update(buf)
 end
 dig
end

If you want to increase efficiency, you can do this, which will prevent new 
strings to be created as buffers all the time:

md5 = File.open(file, 'rb') do |io|
  dig = Digest::MD5.new
  buf = ""
  while io.read(4096, buf)
    dig.update(buf)
  end
  dig
end

Here's another nice variant:

md5 = File.open(file, 'rb') do |io|
  dig = Digest::MD5.new
  buf = ""
  dig.update(buf) while io.read(4096, buf)
  dig
end

Kind regards

    robert