Martin DeMello wrote: > On Mon, Oct 6, 2008 at 12:12 PM, Ragav Satish <ragavsatish / gmail.com> > wrote: >> 4. There are more complicated blocking schemes - like n-gram(chunk by N >> common continuous characters) or even sampling based ones. > > n-gram were my first thought - indeed, I'd started writing out a > bigram-based scheme, then I realised that it'd fail badly if there was > a common word like "systems" or "computers" that a lot of the entries > had. Maybe some sort of multipass scheme to first reduce each entry to > a characteristic word, then do n-gram frequency analysis on those > words (my idea was this: pass 1: make up a frequency table of bigrams, > pass 2: characterise each entry by the presence/multiplicity of the > six or so most common ones) > > martin Yes indeed. The choice of what blocking scheme to use is largely dependent on what kinds of key variations are expected and the size of the database. 1. As a first stage of normalization, OP mentioned he was stripping of INC, LLC, Limited and the like. As you mentioned earlier it might possible to use a single token .. so the second token like Systems, Computers etc can be stripped off during blocking and only get applied in Phase II while comparing intra block records using a more exact scheme like lev distance. 2. Before we even go to a bigram scheme, simple schemes like Soundex, First Two character truncation, NYSIIS etc should be tried. 3. Better yet two or more of these should be applied in multiple passes and the blocks built out of the union of blocks produced with each method. 4. If bigrams are used then I would go with the full bigram indexing(not just the six most common) with a threshold value and reverse indexes . http://datamining.anu.edu.au/publications/2003/kdd03-3pages.pdf (sec 2.1) (This paper talks of python code so possibly it could be easily converted to ruby). --Cheers --Ragav -- Posted via http://www.ruby-forum.com/.