* Robert Klemme <shortcutter / googlemail.com> (09:04) schrieb:

> If I'd really need it I'd probably do a heuristic based on
> distribution of byte values across an initial portion of the file.

That only shows how many non-ascii-characters are used. It won't
recognise russian script in utf-8 as text, or uuencode as binary.

What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
is text if it's logically organized in short lines, and eohl cahracters
are used only for ending lines.

class File
  def self.binary?(name)
    cr, len, mlen = false, 0, 0
    File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
      return false if cr and bt != 10
      case bt
        when 13
          cr = true
        when 10
          mlen = len if len > mlen
          len = 0
        else
          len += 1
      end
    end
    mlen > 1000
  end
end

I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.

mfg,                     simon .... l