Trans wrote:
> what's the best way to determine if a file is yaml?

In light of the other responses, which show how hard it is to do this in 
general, what about a pragmatic approach that might work in most of the 
cases you are interested in?

Look at the first N lines.

If any line has _any_ non-printing characters, it's not correct YAML and 
wasn't generated by YAML#dump.[1]

If any are longer than M chars or other binary file heuristics apply[2], 
it's probably not a manually written YAML file.

If it passes at least _one_ of these two checks, then check to see if 
80% of the (first N) lines match the following:

/^\s*(-|\?|[\w\s]*:)\s/

Maybe add some logic to skip blocks of text like this (so they don't 
count against the 80%):

a: |
   skip
   me

Also, check for > in place of |.

And also skip blanks and comments /^\s*(#|$)/.

And then finally load it and rescue any ArgumentError.

There are probably a lot of corner cases that kill this approach if you 
cannot tolerate false negatives (i.e., legit yaml that gets rejected by 
the above).

---

[1] The YAML spec, http://yaml.org/spec/current.html, says nonprinting 
chars are encoded (see 4.1.1. Character Set), and it seems to be true, 
at least in the dump output:

irb(main):023:0> puts({"a"=>"\002"}.to_yaml)
---
a: !binary |
   Ag==

However, YAML can load unescaped binary data, as Devin showed:

irb(main):025:0> YAML.load "a: \002"
=> {"a"=>"\002"}

[2] For example, 
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/52548

-- 
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407