I've got a question that's more to do with good old fashioned
language-agnostic computer science as it is to do with ruby (however,
i'm using ruby on rails so it's being asked on this forum).

I have several groups of results, with group size varying from 10ish to
250ish.  Within each group, i want to make subgroups according to
similar names:  for example, one of my groups looks like this:  the
first part, "viola", is the name of the group, the second the name of
the item:

viola   Viola - Goija playing Bach
viola   Viola - Goija playing Handel - olga
viola   Viola - Goija playing Handel - olgagoija
viola   Viola - Goija playing Handel Sonata VI (allegro) - olga
viola   Viola - Goija playing Handel Sonata VI (largo) - olga
viola   Viola - Goija playing Handel Sonata VI (last movement) - olga
viola   Viola - Goija playing Schnittke (part 1 - Largo)
viola   Viola - Goija playing Schnittke (part 2 - Allegro Molto) - olga
viola   Viola - Goija playing Schnittke (part 3 - Largo) olga
viola   Viola - Goija playing Schumann (Lebhaft) - olga
viola   Viola - Goija playing Schumann (Nicht schnell) - olga
viola   Viola - Goija playing Shostakovich - bit 1
viola   Viola - Olga Goija playing Shulman (1)
viola   Viola - Olga Goija playing Shulman (2)
viola   Viola - Olga Goija playing Shulman (final)

In this case, i would want to make 13 groups, based on similar names,
like so:

viola   Viola - Goija playing Bach

viola   Viola - Goija playing Handel - olga
viola   Viola - Goija playing Handel - olgagoija

viola   Viola - Goija playing Handel Sonata VI (allegro) - olga
viola   Viola - Goija playing Handel Sonata VI (largo) - olga
viola   Viola - Goija playing Handel Sonata VI (last movement) - olga

viola   Viola - Goija playing Schnittke (part 1 - Largo)
viola   Viola - Goija playing Schnittke (part 2 - Allegro Molto) - olga
viola   Viola - Goija playing Schnittke (part 3 - Largo) olga

viola   Viola - Goija playing Schumann (Lebhaft) - olga

viola   Viola - Goija playing Schumann (Nicht schnell) - olga

viola   Viola - Goija playing Shostakovich - bit 1

viola   Viola - Olga Goija playing Shulman (1)
viola   Viola - Olga Goija playing Shulman (2)
viola   Viola - Olga Goija playing Shulman (final)

My first question is, what's the best way to divide this set into
groups?  They are already all ferret-indexed, so i could do fuzzy ferret
searches.  One thing i was thinking was as follows (pseudocode):

for each item
  for every member of every group
    if the string matches according to some fixed similarity criteria
      put the item in that group
      matched = true
    end
  end
  if not matched
    put the item in a new group
  end
end

The problem with this is deciding the fixed similarity criteria: it
might be better to do something flexible, like

for every pair of items in the group (ie size * size-1 times)
  get similarity_rating and put it in a 2d array
end

then, analyze the array and group the highest scoring elements for each
row together (somehow).

Any thoughts?
-- 
Posted via http://www.ruby-forum.com/.