On Tue, 2 Jan 2001, Ben Tilly wrote:

> gotoken / math.sci.hokudai.ac.jp (GOTO Kentaro) wrote:
> >
> >In message "[ruby-talk:8413] Re: speedup of anagram finder"
> >     on 00/12/31, David Alan Black <dblack / candle.superlink.net> writes:
> >
> > >So.... I think for general anagram-finding, the unpack one is still
> > >the one to beat.  (Of course, it is not impossible that one would want
> > >to find anagrams of 5-letter words.  My first Ruby program of any size
> > >was a Jotto implementation.  Maybe it's time to have another look at
> > >that :-)
> >
> >Thank you for interesting measurements.  Looking that, I'm feeling
> >difficulty of the average cost analysis in pragmatic sense.  As Guy
> >said [ruby-talk:8414], we should consider about also frequency of GC.
> 
> At some point in optimization you always reach the point where
> you make trade-offs.  There isn't necessarily better in general.
> Merely better for my situation.

My situation is that I want to do general anagram-finding :-)  Assuming
you're responding to my use of the word "general" -- I *think* it was
reasonably clear that I was talking about finding anagrams among a
mixed, non-fixed-length, list of words (such as a sizable chunk of
/usr/dict/words), rather than taking some overall position against the
idea that optimization involves trade-offs.  (See the next sentence,
for example.)  But maybe I'm misreading your response. 

> Has anyone tried using the frequency distribution of characters in
> English?  Have the most common letters assigned to the smallest
> primes.  This should keep the size of the index down, and I think
> would significantly improve performance...

Interesting idea.  A first go at this (with, I confess, papered-over
treatment of uppercase letters, i.e., everything is getting downcased
pending some work on that part) is the first threat to the Reign of
Unpack.  The following is based on all of /usr/dict/words, using
letter frequencies derived directly from that file:

			 user     system      total        real
   [first reports removed]
   frequency        14.470000   0.000000  14.470000 ( 14.483800)
   orig_primes      24.450000   0.010000  24.460000 ( 24.458457)
   unpack           15.550000   0.050000  15.600000 ( 15.613195)

OK, time to dust off the argument that hard-coding to ASCII is a
limited approach :-)


David

-- 
David Alan Black
home: dblack / candle.superlink.net
work: blackdav / shu.edu
Web:  http://pirate.shu.edu/~blackdav