Grouping results by similar names

I’ve got a question that’s more to do with good old fashioned
language-agnostic computer science as it is to do with ruby (however,
i’m using ruby on rails so it’s being asked on this forum).

I have several groups of results, with group size varying from 10ish to
250ish. Within each group, i want to make subgroups according to
similar names: for example, one of my groups looks like this: the
first part, “viola”, is the name of the group, the second the name of
the item:

viola Viola - Goija playing Bach
viola Viola - Goija playing Handel - olga
viola Viola - Goija playing Handel - olgagoija
viola Viola - Goija playing Handel Sonata VI (allegro) - olga
viola Viola - Goija playing Handel Sonata VI (largo) - olga
viola Viola - Goija playing Handel Sonata VI (last movement) - olga
viola Viola - Goija playing Schnittke (part 1 - Largo)
viola Viola - Goija playing Schnittke (part 2 - Allegro Molto) - olga
viola Viola - Goija playing Schnittke (part 3 - Largo) olga
viola Viola - Goija playing Schumann (Lebhaft) - olga
viola Viola - Goija playing Schumann (Nicht schnell) - olga
viola Viola - Goija playing Shostakovich - bit 1
viola Viola - Olga Goija playing Shulman (1)
viola Viola - Olga Goija playing Shulman (2)
viola Viola - Olga Goija playing Shulman (final)

In this case, i would want to make 13 groups, based on similar names,
like so:

viola Viola - Goija playing Bach

viola Viola - Goija playing Handel - olga
viola Viola - Goija playing Handel - olgagoija

viola Viola - Goija playing Handel Sonata VI (allegro) - olga
viola Viola - Goija playing Handel Sonata VI (largo) - olga
viola Viola - Goija playing Handel Sonata VI (last movement) - olga

viola Viola - Goija playing Schnittke (part 1 - Largo)
viola Viola - Goija playing Schnittke (part 2 - Allegro Molto) - olga
viola Viola - Goija playing Schnittke (part 3 - Largo) olga

viola Viola - Goija playing Schumann (Lebhaft) - olga

viola Viola - Goija playing Schumann (Nicht schnell) - olga

viola Viola - Goija playing Shostakovich - bit 1

viola Viola - Olga Goija playing Shulman (1)
viola Viola - Olga Goija playing Shulman (2)
viola Viola - Olga Goija playing Shulman (final)

My first question is, what’s the best way to divide this set into
groups? They are already all ferret-indexed, so i could do fuzzy ferret
searches. One thing i was thinking was as follows (pseudocode):

for each item
for every member of every group
if the string matches according to some fixed similarity criteria
put the item in that group
matched = true
end
end
if not matched
put the item in a new group
end
end

The problem with this is deciding the fixed similarity criteria: it
might be better to do something flexible, like

for every pair of items in the group (ie size * size-1 times)
get similarity_rating and put it in a 2d array
end

then, analyze the array and group the highest scoring elements for each
row together (somehow).

Any thoughts?

Max:

Your second approach is not only more flexible, it has also
the advantage of clearly structuring your problems into
two different steps.

Step 1, defining a similarity_rating: Given the complexity of
your input data, you should start with something as simple
as a count of matching words, and then gradually refine.

As result, step 1 gives you a quadratic, symmetric matrix
that can be treated as if containing covariances.

Step 2, cluster variables according to their covariance matrix:
While there is no unique solution, there are several well
established procdures. Scan books on multivariate data analysis
for “cluster analysis” or similar.

Below, I add some code which I wrote a while ago for my own use -
take it for what it’s worth. A result of such a cluster analysis
is shown in messen-und-deuten.de - messen und deuten Resources and Information.,
page 73.

Good luck, Joachim

calculate ng*(ng-1)/2 correlation coefficients:

mg = gv.size-1
xcorr = (0…mg).collect{ (0…mg).collect { 0.0 } }
(0…mg).each do |jg|
print “correlations for #{gv[jg]}:” if verbose
(0…jg-1).each do |ig|
xcorr[jg][ig] = calc_xcorr jg, ig, vv, nq
printf " %4.2f", xcorr[jg][ig]
end
print “\n”
end

determine closest correlation and confound the two owners:

while mg>0
# determine closest correlation:
xmin = 0
imin = nil
jmin = nil
(0…mg).each do |jg|
(0…jg-1).each do |ig|
if xcorr[jg][ig]>xmin
xmin = xcorr[jg][ig]; imin = ig; jmin = jg
end
end
end
printf “%6.4f join #{gv[imin]} and #{gv[jmin]}\n”, xmin

# confound into imin:
(0..nq-1).each do |kq|
    y0 = vv[imin][kq]
    y1 = vv[jmin][kq]
    vv[imin][kq] =
        if y0<0
            y1
        elsif y1<0
            y0
        else
            (nv[imin]*y0+nv[jmin]*y1)/(nv[imin]+nv[jmin])
        end
end
nv[imin] = nv[imin]+nv[jmin]
gv[imin] = gv[imin]+"&"+gv[jmin]

# eliminate jmin:
gv.delete_at jmin
nv.delete_at jmin
vv.delete_at jmin
xcorr.delete_at jmin
xcorr.each do |xv|
    xv.delete_at jmin
end

# recalculate correlations with imin:
mg -= 1
( 0..imin-1).each{ |ig| xcorr[imin][ig] = calc_xcorr imin, ig, vv,

nq }
(imin+1…mg).each{ |jg| xcorr[jg][imin] = calc_xcorr jg, imin, vv,
nq }
end