Looking for patterns in a collection

aris · September 3, 2012, 12:52am

Greetings, i have the following (fragment olny) :

devices = [{d1 => [0,1,0,1,0,0,…]},
{d2 => [0,0,1,1,0,1,…]},
{d3 => [1,0,1,1,0,1,…]},
…]

Each hash represents device, the array of zeroes and ones representing
the presence or absence of a specific software executable on that
device. (array index signifies a unique identifier to the executable)

Scenario:
A list of desktop computers (~600) that includes all software
executables deployed on each machine. There more than 1,000 distinct
executables deployed across these devices.

I am trying to identify common patterns of software deployments, looking
for an algoritm that will arrange the current state into a limited,
manageable set of discrete application profiles.

No two devices are exactly alike, even after filtering out all the
‘noise’ - different versions, common operating system and driver files
etc. So the comparison must proceed with some element of configurable
tolerance.

I have had a look at the Vector and Matrix classes but haven’t found any
suitable built in method that would allow me to submit a suitably
prepared list(an array of arrays?, an array of vectors?, a matrix?)
which is then rationalised into a few recognisably discrete groups.

Are there any libraries out ther i could leverage? I am not a maths /
statistics guy?

Regards, Rainer

rainer · September 3, 2012, 10:57am

On Mon, Sep 3, 2012 at 12:52 AM, Rainer T. [email protected]
wrote:

So the comparison must proceed with some element of configurable
tolerance.

This seems to be the crucial point: you need to explicate what
“configurable tolerance” means. Only you can define that requirement.
Then someone might be able to answer the question:

Are there any libraries out ther i could leverage? I am not a maths /
statistics guy?

Kind regards

robert

rainer · September 3, 2012, 12:04pm

Robert K. wrote in post #1074405:

On Mon, Sep 3, 2012 at 12:52 AM, Rainer T. [email protected]
wrote:

So the comparison must proceed with some element of configurable
tolerance.

This seems to be the crucial point: you need to explicate what
“configurable tolerance” means. Only you can define that requirement.
Then someone might be able to answer the question:

Thanks Robert. I am trying to say that i don’t expect any perfect
matches, so the comparison should accommodate some (but not too much)
variation, i.e. identify ‘approximate’ matches. I expect i will have to
use a trial and error approach (hence configurable) to find an
appropriate “tolerance level”.

Say for example i have the following four entries:

{d1 => [‘zero’,‘one’,‘two’,‘three’,‘four’,‘five’,‘six’]}
{d2 => [‘zero’,‘one’,‘‘,‘three’,‘four’,‘five’,‘six’]}
{d3 => [’‘,‘one’,‘two’,‘three’,‘four’,’‘,’’]}
{d4 => [‘zero’,‘one’,‘two’,‘three’,‘four’,‘five’,‘six’]}

Say i define “tolerance level” as the maximum number of differences
allowed before the comparison returns a mismatch, then:

tolerance level 0 will match d1 and d4,
tolerance level 1 will match d1, d2 and d4,
tolerance level 2 will match d1, d2 and d4,
tolerance level 3 will match all of them.

Regards, Rainer

rainer · September 3, 2012, 12:25pm

Alex G. wrote in post #1074412:

On 02.09.2012 23:52, Rainer T. wrote:
A common distance metric for sets is the Jacquard distance
(Jaccard index - Wikipedia). In your case each
collection of executables is a set. Implementing it in Ruby is quite
easy using the built in Array union and intersection operators:

Alex, many thanks for this. The Jaccard index looks like an answer.
Thank you also for the worked example…

Once you have calculated the distance between all machines in this way
you can cluster them however you like. See
(http://colinfdrake.com/2011/05/28/clustering-in-ruby.html) for an
example of k-means clustering in Ruby. Another possibility is to
generate a graph representation of your data and visualise with
something like GraphViz (GitHub - glejeune/Ruby-Graphviz: [MIRROR] Ruby interface to the GraphViz graphing tool).

… and these further references.

Cheers, Rainer

rainer · September 3, 2012, 11:58am

On 02.09.2012 23:52, Rainer T. wrote:

device. (array index signifies a unique identifier to the executable)

which is then rationalised into a few recognisably discrete groups.

Are there any libraries out ther i could leverage? I am not a maths /
statistics guy?

Regards, Rainer

A common distance metric for sets is the Jacquard distance
(Jaccard index - Wikipedia). In your case each
collection of executables is a set. Implementing it in Ruby is quite
easy using the built in Array union and intersection operators:

a = [1,0,0,1,0,1]
b = [1,0,1,1,0,1]
c = [0,1,0,0,1,0]

a_idx = []
a.each_with_index{|x,i| x == 1 ? a_idx << i : nil}
b_idx = []
b.each_with_index{|x,i| x == 1 ? b_idx << i : nil}
c_idx = []
c.each_with_index{|x,i| x == 1 ? c_idx << i : nil}

#Distance from a to b is 0.25
((a_idx | b_idx).size.to_f - (a_idx & b_idx).size) / (a_idx |
b_idx).size
#Distance from a to c is 1
((a_idx | c_idx).size.to_f - (a_idx & c_idx).size) / (a_idx |
c_idx).size

Once you have calculated the distance between all machines in this way
you can cluster them however you like. See
(http://colinfdrake.com/2011/05/28/clustering-in-ruby.html) for an
example of k-means clustering in Ruby. Another possibility is to
generate a graph representation of your data and visualise with
something like GraphViz (GitHub - glejeune/Ruby-Graphviz: [MIRROR] Ruby interface to the GraphViz graphing tool).