2008/10/6 Martin DeMello <martindemello / gmail.com>:
> On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <janfischer / mail.org> wrote:
>> At the moment I compare two strings by making them lowercase, deleting dots
>> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
>> Levenshtein-Distance of the metaphone-key of the two strings.
>> Works not really good and is damn slow, but it's okay and best I could
>> figure out. (Nevertheless your hints on that are welcome too.)
>>
>> My problem is, that I don't know how to apply my compare-method in an
>> efficient way. What I'm doing now is selecting the first row where Group is
>> NULL and then selecting each row (where Group is NULL) in my database in a
>> loop again, comparing with the first selected string and setting Group to a
>> certain number if comapare method says they match.
>
> The problem with relying on a compare function for grouping is that it
> leads naturally to an O(n^2) solution. Ideally, what you want is some
> sort ofo "characteristic" function, so that two entries with the same
> characteristic are likely to be in the same grouping. Of course,
> coming up with the characteristic is not always easy, and may not even
> be possible, but it's a useful direction to think in. Look up fuzzy
> searching techniques, too, and see if they have anything adaptable.

Also, look into your database's feature set. Most modern RDBMS come
with text searching capabilities and dealing with large volumes of
data is where those beasts excel.  If such a characteristic function
can be found chances are that you can implement it as function in the
DB and resolve the issue via database queries / updates.

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end