Any command line tools for querying yaml files

luislavena · November 18, 2011, 2:29pm

(Sorry, this is not exactly a ruby question).

Today some ruby command-line apps are using YAML to store data. The
advantage of using sqlite or say a delimited file was that outside one’s
application one could query the data using sql, or awk/grep/sed/cut for
delimited files.

With yaml, are there any generic tools to query data ?

sentinel · November 18, 2011, 3:58pm

On 18 Nov 2011, at 13:29, Rahul K. wrote:

(Sorry, this is not exactly a ruby question).

Today some ruby command-line apps are using YAML to store data. The
advantage of using sqlite or say a delimited file was that outside one’s
application one could query the data using sql, or awk/grep/sed/cut for
delimited files.

With yaml, are there any generic tools to query data ?

It is just a text file so any of your suggestions (awk/grep/sed/cut)
will work.

Dave.

sentinel · November 18, 2011, 4:25pm

On Fri, Nov 18, 2011 at 3:58 PM, Dave B. [email protected]
wrote:

With yaml, are there any generic tools to query data ?

It is just a text file so any of your suggestions (awk/grep/sed/cut) will work.

Why not just read the file in an IRB session and traverse the object
graph?

Kind regards

robert

sentinel · November 18, 2011, 5:17pm

Robert K. wrote in post #1032544:

On Fri, Nov 18, 2011 at 3:58 PM, Dave B. [email protected]
wrote:

With yaml, are there any generic tools to query data ?

It is just a text file so any of your suggestions (awk/grep/sed/cut) will work.

Why not just read the file in an IRB session and traverse the object
graph?

Kind regards

robert

Some people store data in multiple yaml files. grep only gets me a
matching line. Not other fields of that file. Even more difficult for
“greater than” or similar queries.

I once did a ruby commandline project in which i stored data in a
multi-row format. Great for inserts and if schema changed. But not great
to query. Equally bad for updating. I then changed it to a
delimited/fixed format. Great for query, insert and update, but horrible
when adding columns inside.
Finally, I did another version in sqlite.

Thus i have apprehensions about using yaml. I’d like to know about
generic tools for fast querying of yml data.

sentinel · November 18, 2011, 8:01pm

On Fri, Nov 18, 2011 at 5:17 PM, R. Kumar [email protected]
wrote:

Robert K. wrote in post #1032544:

On Fri, Nov 18, 2011 at 3:58 PM, Dave B. [email protected]
wrote:

With yaml, are there any generic tools to query data ?

It is just a text file so any of your suggestions (awk/grep/sed/cut) will
work.

Why not just read the file in an IRB session and traverse the object
graph?

Some people store data in multiple yaml files. grep only gets me a
matching line. Not other fields of that file. Even more difficult for
“greater than” or similar queries.

What does that have to do with my posting?

I once did a ruby commandline project in which i stored data in a
multi-row format. Great for inserts and if schema changed. But not great
to query. Equally bad for updating. I then changed it to a
delimited/fixed format. Great for query, insert and update, but horrible
when adding columns inside.
Finally, I did another version in sqlite.

Again, what does it have to do with what I wrote?

Thus i have apprehensions about using yaml. I’d like to know about
generic tools for fast querying of yml data.

YAML stores object graphs - much the same way as an OO database. The
speed of querying dramatically depends how the data is laid out. If
there is no indexing (think: Hash) then you might need full
traversals. Anyway, if the file’s contents fit into memory then
#select etc. should do pretty good. Maybe you help us by stating what
specific kinds of queries you want to do.

In the meantime I am still inclined to believe that this can be done
pretty slick with IRB and regular Ruby code - maybe with some custom
additional methods.

Kind regards

robert

sentinel · November 18, 2011, 5:09pm

Dave B. wrote in post #1032538:

It is just a text file so any of your suggestions (awk/grep/sed/cut)
will work.

Dave.

I mean yaml is multiline. It can contain objects within objects. I would
have to write a large program using awk and others to parse it. So that
makes the job more difficult. I need to know yaml syntax to write that.

In an sql table, i can get the table schema, and then just run an sql.

For yaml, is there a way to say "give me title and author where category
= ‘unix’ etc in some format. Are there tools that can read up the
structure of any yml file and allow us to query telling what we want
without having to understand yaml format, and how to retrieve it.

sentinel · November 18, 2011, 8:43pm

On 11/18/2011 08:09 AM, R. Kumar wrote:

For yaml, is there a way to say "give me title and author where category
= ‘unix’ etc in some format. Are there tools that can read up the
structure of any yml file and allow us to query telling what we want
without having to understand yaml format, and how to retrieve it.

Does ypath help?

Never used it myself…

http://yaml4r.sourceforge.net/doc/page/parsing_yaml_documents.htm
https://www.ruby-toolbox.com/gems/ytools

sentinel · November 18, 2011, 8:10pm

What does that have to do with my posting?

Again, what does it have to do with what I wrote?

Sorry, you had quoted Dave who mentioned grep/awk, and i got mixed up on
the quoting. I was just explaining that grep and awk won;t work well in
such a case.

Since each record is often stored in a separate file thus i had felt
reading up in irb may not be simple. I’ve seen one such application
perform very slowly in a search of its own data.

sentinel · November 19, 2011, 11:18am

On Fri, Nov 18, 2011 at 8:42 PM, Joel VanderWerf
[email protected] wrote:

http://yaml4r.sourceforge.net/doc/page/parsing_yaml_documents.htm
https://www.ruby-toolbox.com/gems/ytools

Oh, interesting: XPath for Yaml. Apparently it’s part of the std lib

at least with 1.9.2 I did not need to install something:

robert@fussel:~/projects/rbp-blog$ irb19 -r yaml
irb(main):001:0> players = YAML::parse( <<EOY )
irb(main):002:1" player:
irb(main):003:1" - given: Sammy
irb(main):004:1" family: Sosa
irb(main):005:1" - given: Ken
irb(main):006:1" family: Griffey
irb(main):007:1" - given: Mark
irb(main):008:1" family: McGwire
irb(main):009:1" EOY
=> #Syck::Map:0x95943c8
irb(main):010:0> players
=> #Syck::Map:0x95943c8
irb(main):011:0> players.class
=> Syck::Map
irb(main):012:0> players.select ‘//given’
=> [#Syck::Scalar:0x95953f4, #Syck::Scalar:0x9594fe4,
#Syck::Scalar:0x9594738]
irb(main):013:0> players.select(‘//given’).transform
NoMethodError: undefined method transform' for #<Array:0x98f5058> from (irb):13 from /usr/local/bin/irb19:12:in ’
irb(main):014:0> players.select(‘//given’).map &:class
=> [Syck::Scalar, Syck::Scalar, Syck::Scalar]
irb(main):019:0> players.select(‘//given’).map &:value
=> [“Sammy”, “Ken”, “Mark”]

Still the issue with multiple files would hold. Rahul, I believe this
is your major problem: you have to define relationships between data
from different sources within the query, because there is no inherent
relationship between those files. It’s a bit like a federated
database approach for RDBMS: if you combine different databases which
know nothing of each other to a single database you need a federation
layer which does all the SQL interpretation and joining between
different sources.

Bottom line: you’re easier off by having it in one Yaml file only.
You could write a transformation routine which reads in all the
different sources makes the connections and then allows you to query a
single object graph.

Kind regards

robert

sentinel · November 19, 2011, 1:33pm

Thanks Robert.
The issue here is not just me. Some people are using a record per yaml
file approach. My issue is not just with me pulling up the file or files
in IRB and doing some transformations. Users of my application will find
it hard to query the data as they could if it were SQL or CSV.

At this moment, i think sqlite is the simplest way of data storage. I
suppose if i really need to version the data, I could take a dump using
“.mode line” for the important columns and keep that in version control.

Still, I’ll check out the ypath and other tools mentioned. Thanks for
the inputs.

sentinel · November 19, 2011, 2:35pm

On Sat, Nov 19, 2011 at 1:34 PM, R. Kumar [email protected]
wrote:

Thanks Robert.
The issue here is not just me. Some people are using a record per yaml
file approach. My issue is not just with me pulling up the file or files
in IRB and doing some transformations. Users of my application will find
it hard to query the data as they could if it were SQL or CSV.

At this moment, i think sqlite is the simplest way of data storage. I
suppose if i really need to version the data, I could take a dump using
“.mode line” for the important columns and keep that in version control.

If you are free to choose the storage (as it now seems) you could as
well add version information to your data (e.g. a modified timestamp
column). It’s not easy because there are some pitfalls but then you
can keep all the data in one location. If you use sqlite you can even
add proper indexes and views.

Of course, with CSV or other text based file formats a version control
system would be an easy way to hold the versions.

Still, I’ll check out the ypath and other tools mentioned. Thanks for
the inputs.

You’re welcome!

Kind regards

robert