File.yaml?(fname)

what’s the best way to determine if a file is yaml?

thanks,
t.

Trans wrote:

what’s the best way to determine if a file is yaml?
Naive answer:

def File.yaml?(fname)
YAML.load(IO.read(fname))
true
rescue ArgumentError
false
end

Though, open up irb -ryaml and keep running this line:
YAML.load Array.new(60){rand 256}.pack(‘c*’)

I’m not sure that’s what you’re after. :slight_smile:

And I’m guessing you didn’t mean:
def File.yaml?(fname)
extname(fname) =~ /^ya?ml$/
end

Devin

Trans wrote:

what’s the best way to determine if a file is yaml?

Process the file using a parser meant to process YAML. If the parse
fails,
it means:

  1. The file isn’t YAML.

  2. The chosen parser is not robust enough to process this specific,
    valid
    YAML file.

  3. The YAML file, although more or less valid YAML, has syntax errors
    not
    consistent with the formal YAML specification.

  4. The YAML specification contains ambiguities that allow a valid parser
    to
    fail on valid YAML syntax.

  5. Other.

In other words, you cannot really say, absolutely and unambiguously,
that a
particular file is a YAML file.

Trans wrote:

what’s the best way to determine if a file is yaml?

In light of the other responses, which show how hard it is to do this in
general, what about a pragmatic approach that might work in most of the
cases you are interested in?

Look at the first N lines.

If any line has any non-printing characters, it’s not correct YAML and
wasn’t generated by YAML#dump.[1]

If any are longer than M chars or other binary file heuristics apply[2],
it’s probably not a manually written YAML file.

If it passes at least one of these two checks, then check to see if
80% of the (first N) lines match the following:

/^\s*(-|?|[\w\s]*:)\s/

Maybe add some logic to skip blocks of text like this (so they don’t
count against the 80%):

a: |
skip
me

Also, check for > in place of |.

And also skip blanks and comments /^\s*(#|$)/.

And then finally load it and rescue any ArgumentError.

There are probably a lot of corner cases that kill this approach if you
cannot tolerate false negatives (i.e., legit yaml that gets rejected by
the above).


[1] The YAML spec, http://yaml.org/spec/current.html, says nonprinting
chars are encoded (see 4.1.1. Character Set), and it seems to be true,
at least in the dump output:

irb(main):023:0> puts({“a”=>“\002”}.to_yaml)

a: !binary |
Ag==

However, YAML can load unescaped binary data, as Devin showed:

irb(main):025:0> YAML.load “a: \002”
=> {“a”=>“\002”}

[2] For example,
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/52548

Joel VanderWerf wrote:

wasn’t generated by YAML#dump.[1]
count against the 80%):

There are probably a lot of corner cases that kill this approach if you
cannot tolerate false negatives (i.e., legit yaml that gets rejected by
the above).

yikes! if that’s what it takes then i must run away! :slight_smile: i need
something snappy. actually it just occured to me that as of YAML 1.1
the document declaration is mandetory. I had forgotten about that. So
checking for an initial line starting with %YAML would do the trick as
long as docs where 1.1 compliant --at least in this regard.
Unfortuantely Syck itself isn’t 1.1 compliant in this respect
whatsoever :frowning:

In the mean time I’m just going to go with ara’s suggestion. the use of
an initial ‘—’ is an acceptable requirment for my needs.

t.

On Sun, 10 Dec 2006, Trans wrote:

what’s the best way to determine if a file is yaml?

thanks,
t.

in ruby queue i detect whether stdin input is a normal list or yaml in
this
way:

if first_non_blank_line =~ %r/^\s*—\s*$/
load_yaml_from_stdin
else
process_line first_non_blank_line
while((line = next_line[stdin]))
process_line line
end
end

not perfect, but’s it worked well enough so far

cheers.

also, from the command line i’ve taken to this approach

list_input_on_stdin = ARGV.delete ‘-’
yaml_input_on_stdin = ARGV.delete ‘—’

for, for example

cat.rb - # dump stdin

cat.rb — # load the yaml doc on stdin and dump that

note that ‘–’ is used to indicate the end of options so it is not a
good
flag.

regards.

-a