I’ve got a question that’s more to do with good old fashioned
language-agnostic computer science as it is to do with ruby (however,
i’m using ruby on rails so it’s being asked on this forum).
I have several groups of results, with group size varying from 10ish to
250ish. Within each group, i want to make subgroups according to
similar names: for example, one of my groups looks like this: the
first part, “viola”, is the name of the group, the second the name of
the item:
viola Viola - Goija playing Bach
viola Viola - Goija playing Handel - olga
viola Viola - Goija playing Handel - olgagoija
viola Viola - Goija playing Handel Sonata VI (allegro) - olga
viola Viola - Goija playing Handel Sonata VI (largo) - olga
viola Viola - Goija playing Handel Sonata VI (last movement) - olga
viola Viola - Goija playing Schnittke (part 1 - Largo)
viola Viola - Goija playing Schnittke (part 2 - Allegro Molto) - olga
viola Viola - Goija playing Schnittke (part 3 - Largo) olga
viola Viola - Goija playing Schumann (Lebhaft) - olga
viola Viola - Goija playing Schumann (Nicht schnell) - olga
viola Viola - Goija playing Shostakovich - bit 1
viola Viola - Olga Goija playing Shulman (1)
viola Viola - Olga Goija playing Shulman (2)
viola Viola - Olga Goija playing Shulman (final)
In this case, i would want to make 13 groups, based on similar names,
like so:
viola Viola - Goija playing Bach
viola Viola - Goija playing Handel - olga
viola Viola - Goija playing Handel - olgagoija
viola Viola - Goija playing Handel Sonata VI (allegro) - olga
viola Viola - Goija playing Handel Sonata VI (largo) - olga
viola Viola - Goija playing Handel Sonata VI (last movement) - olga
viola Viola - Goija playing Schnittke (part 1 - Largo)
viola Viola - Goija playing Schnittke (part 2 - Allegro Molto) - olga
viola Viola - Goija playing Schnittke (part 3 - Largo) olga
viola Viola - Goija playing Schumann (Lebhaft) - olga
viola Viola - Goija playing Schumann (Nicht schnell) - olga
viola Viola - Goija playing Shostakovich - bit 1
viola Viola - Olga Goija playing Shulman (1)
viola Viola - Olga Goija playing Shulman (2)
viola Viola - Olga Goija playing Shulman (final)
My first question is, what’s the best way to divide this set into
groups? They are already all ferret-indexed, so i could do fuzzy ferret
searches. One thing i was thinking was as follows (pseudocode):
for each item
for every member of every group
if the string matches according to some fixed similarity criteria
put the item in that group
matched = true
end
end
if not matched
put the item in a new group
end
end
The problem with this is deciding the fixed similarity criteria: it
might be better to do something flexible, like
for every pair of items in the group (ie size * size-1 times)
get similarity_rating and put it in a 2d array
end
then, analyze the array and group the highest scoring elements for each
row together (somehow).
Any thoughts?