Martin DeMello wrote:
> On Mon, Oct 6, 2008 at 12:12 PM, Ragav Satish <ragavsatish / gmail.com> 
> wrote:
>> 4. There are more complicated blocking schemes - like n-gram(chunk by N
>> common continuous characters) or even sampling based ones.
> 
> n-gram were my first thought - indeed, I'd started writing out a
> bigram-based scheme, then I realised that it'd fail badly if there was
> a common word like "systems" or "computers" that a lot of the entries
> had. Maybe some sort of multipass scheme to first reduce each entry to
> a characteristic word, then do n-gram frequency analysis on those
> words (my idea was this: pass 1: make up a frequency table of bigrams,
> pass 2: characterise each entry by the presence/multiplicity of the
> six or so most common ones)
> 
> martin

Yes indeed. The choice of what blocking scheme to use is largely 
dependent on what kinds of key variations are expected and the size of 
the database.

1. As a first stage of normalization, OP mentioned he was stripping of 
INC, LLC, Limited and the like. As you mentioned earlier it might 
possible to use a single token .. so the second token like Systems, 
Computers etc can be stripped off during blocking and only get applied 
in Phase II while comparing intra block records using a more exact 
scheme like lev distance.

2. Before we even go to a bigram scheme, simple schemes like Soundex, 
First Two  character truncation, NYSIIS etc should be tried.

3. Better yet two or more of these should be applied in multiple passes 
and the blocks built out of the union of blocks produced with each 
method.

4. If bigrams are used then I would go with the full bigram indexing(not 
just the six most common) with a threshold value and reverse indexes . 
http://datamining.anu.edu.au/publications/2003/kdd03-3pages.pdf (sec 
2.1) (This paper talks of python code so possibly it could be easily 
converted to ruby).

--Cheers
--Ragav

-- 
Posted via http://www.ruby-forum.com/.