At the risk of being off-topic or uninformed, I guess this

> --------------------------------------------
> edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten
> --------------------------------------------

is overkill. The thing is that there are WAY too many off-by-two
mistakes but too few words to be worth that computation.
Moreover, most of the "corrections" are nonsense: "xx" "zx"
"hh" "jk" "bb" almost surely. This is where dictionary metrics
must have come into being, I think. (I have not the original post,
sorry, so this may be out of place).

Take into account that for a 4-letter word like "riot" you
are letting the user have up to 2 mistakes, that means from:

"at" to (at least) "tricot"

which ... seems too loose to me. I am utterly unaware of the way
dictionaries work, but I bet they use hashes (in the sense of
"a (key, value) list" and in the proper classical sense of
"a good function for distinguishing and identifying items"),
as well as ad-hoc metrics A LOT. I'd rather stick with words of
the same length and at most off-by-one mistakes (or in obvious
cases, for longer words, other possibilities). There must be
a kind of metric-hash pair to play with the duality: number of
mistakes "allowed" / number of "similar" real words / length
of the word. Obviously,

"termodynamics"

gives rise to few doubts, but what about

"lykes"

"bykes" "Mykes" "Bites" "Bytes" "Likes" "Like" "Myke" "Mike"
"Byke"... and this without thinking.

I guess the hash/metric problem is the real game in Dictionary
software. I'd be glad to help if I find the time, though. But
it seems a trodden path to me (apart from the doing it for the joy
of it, of course).


2c

Pedro

El 12/04/2007, a las 18:04, Nicolas Buet escribióº

> I am so glad I find this topic here! I was on the plane yesterday, and
> just like Peter Norwig I managed to write a spelling corrector. An
> obvious difference is that he started from nothing, while I started
> from his code... Anyway, I did it in Ruby, it works, but... it's dead
> slow. If I need to compute the second array (the one with 2 mistakes),
> it takes about 1 second per suggestion.
>
> My edit1 function is like this:
> ----------------------------
> def edits1(word)
>    n = word.length
>
>    return ( (0..n-1).collect {|i| word[0,i] + word[i+1, n-1]}  +  
> #deletion
>    (0..n-2).collect {|i| word[0,i] + word[i+1,1] +word[i,1] +
> word[i+2, n-1]} + #transposition
>    ('a'..'z').collect {|c| (0..n-1).collect { |i| word[0,i] + c +
> word[i+1, n-1]} } + #alteration
>    ('a'..'z').collect {|c| (0..n).collect { |i| word[0,i] + c +
> word[i, n]} } ).flatten #insertion
>
> end
> ---------------------------------------------
>
> I believe this is where I lose most of the time? when I call it, it
> looks like that:
>
> --------------------------------------------
> edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten
> --------------------------------------------
>
> Can anyone see here a flaw in logic, or some function that should be
> avoided because it's known as slow?
>
> On 4/12/07, Drew Olson <olsonas / gmail.com> wrote:
>> This may be a bit OT, but I am really struck as to how elegantly  
>> both of
>> these languages (and the programmers as well) handle this seemingly
>> difficult problem. I've been working with ruby mainly for fun over  
>> the
>> past 8-9 months and seeing this solution (in Python and ruby) really
>> heightens my respect for the elegance and expressiveness of both
>> languages.
>>
>> </rave>
>>
>>
>> --
>> Posted via http://www.ruby-forum.com/.
>>
>>
>

Pedro Fortuny Ayuso
pfortuny / gmail.com      http://pfortuny.sdf-eu.org
C/Capuchinos 14, 1-S. 47006 Valladolid. SPAIN