Hi Ben,
In message "[ruby-talk:8468] Re: speedup of anagram finder"
on 01/01/02, "Ben Tilly" <ben_tilly / hotmail.com> writes:
>At some point in optimization you always reach the point where
>you make trade-offs. There isn't necessarily better in general.
>Merely better for my situation.
I agree 100%.
>Has anyone tried using the frequency distribution of
>characters in English? Have the most common letters
>assigned to the smallest primes. This should keep the
>size of the index down, and I think would significantly
>improve performance...
For anyone would like to start it, I've measured only the distribution
of characters in /usr/share/dict/words. Does it fit to well-known
statistics derived from English corpus?
-- Gotoken
char # occur
e 234814
i 200619
a 198957
o 170392
r 160496
n 158281
t 152574
s 139244
l 130178
c 103307
u 87213
p 78075
m 70505
d 68008
h 64165
y 51527
g 47011
b 40357
f 24112
v 20104
k 16022
w 13826
z 8441
x 6926
q 3730
j 3075