On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <janfischer / mail.org> wrote:
> At the moment I compare two strings by making them lowercase, deleting dots
> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
> Levenshtein-Distance of the metaphone-key of the two strings.
> Works not really good and is damn slow, but it's okay and best I could
> figure out. (Nevertheless your hints on that are welcome too.)
>
> My problem is, that I don't know how to apply my compare-method in an
> efficient way. What I'm doing now is selecting the first row where Group is
> NULL and then selecting each row (where Group is NULL) in my database in a
> loop again, comparing with the first selected string and setting Group to a
> certain number if comapare method says they match.

The problem with relying on a compare function for grouping is that it
leads naturally to an O(n^2) solution. Ideally, what you want is some
sort ofo "characteristic" function, so that two entries with the same
characteristic are likely to be in the same grouping. Of course,
coming up with the characteristic is not always easy, and may not even
be possible, but it's a useful direction to think in. Look up fuzzy
searching techniques, too, and see if they have anything adaptable.

martin