On Fri, 28 Apr 2006, Jake McArthur wrote:

> I've already been working on this. Right now, I'm making a simple algorithm
> that works on arbitrary text and returns a number reflecting how similar two
> strings are. Even this alone has been giving fairly good results on code,
> even code that was written rather differently, but my plan is to use this
> algorithm to compare symbols and literals. A similar algorithm, working on a
> slightly larger scale, would compare entire lines of code for similar
> syntax, augmented by data from the first algorithm.
>
> I'm still thinking about this. Suggestions, anybody?

it seems down at the moment, but this is close/perfect for your needs

   http://complearn.org/

google cache (until site up)

   http://72.14.207.104/search?q=cache:bmlzYI4W39sJ:www.complearn.org/+complearn&hl=en&gl=us&ct=clnk&cd=1

more links

   http://www.newscientist.com/article.ns?id=dn3602
   http://homepages.cwi.nl/~cilibrar/musicart/trnmag.com/Stories/2003/042303/Software_sorts_tunes_042303.html


i've played with it and, since there are command line tools and a ruby api, i
would think you could categorize text quite easily.

we are actually playing with this to identify spatial/temporal trends in
nighttime lights satellite imagery.

cheers.

-a
-- 
be kind whenever possible... it is always possible.
- h.h. the 14th dali lama