I decided to do a somewhat more ambitious test.  After training on
a thousand arbitrary .doc files and a thousand arbitrary .html files
(and tweaking it to return the top 15 words instead of just the top 5) I
fed it Why the lucky stiff's latest opus:


loading wordcount.dat...
reading...
analyzing...
most characteristic words:
he, his, cham, dr, said, ruby, goat, method, 
irb, ree, paij, sentence, him, had, end

Not bad at all.  Although I haven't read it yet myself, this looks like
a quite reasonable summary.  I'm a little surprised at the absence of
flugel and trisomatic, but perhaps WTLS has gotten less predictable in
his vocabulary since the last time I read him.

 -- Markus


On Tue, 2004-09-21 at 21:18, jm wrote:
> Thought I'd given this simple program a go and review it for those 
> curious as to how well it works. Two tests were carried out, the first 
> I only used the following texts. The second repeated the first with 
> additional training texts. The texts are from the project gutenburg 
> (except openbsd35.readme.txt which for some reason was in the same 
> directory).
> 
> $ ls *.txt
> 8ldvc10.txt             openbsd35.readme.txt
> grimm10.txt             sunzu10.txt
> 
> $ cat *.txt |ruby textanalyze.rb c
> reading...
> Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per 
> second
> Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per 
> second
> Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per 
> second
> Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per 
> second
> Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per 
> second
> Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per 
> second
> Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per 
> second
> Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per 
> second
> storing into wordcount.dat...
> Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per 
> second
> 
> I then fed it an text version of my marketing essay which should have 
> very little if anything in common with the training texts.
> 
> $ cat ../assignment1.txt|ruby textanalyze.rb a
> loading wordcount.dat...
> reading...
> analyzing...
> most characteristic words:
> marketing, customers, customer, purchase, interaction
> 
> I then added more texts
> 
> $ ls *.txt
> 8ldvc10.txt             openbsd35.readme.txt    tprnc11.txt
> dracu13.txt             repub11.txt             warw12.txt
> grimm10.txt             sunzu10.txt
> 
> and reran the above creation and analyze steps. to get
> 
> most characteristic words:
> marketing, customers, customer, 4ps, interaction
> 
> So, not bad for such a simple algorithm. As I would have picked the 
> keywords as relationship, marketing, 4Ps,
>   and customer retention. I'm surprised coffee didn't show up as I kept 
> using it in examples. It don't do too badly in this simple test 
> especially considering that the training test was chosen at random and 
> not related to the text analyzed. A dictionary of plurals or some other 
> means of dealing with plurals would be my only suggestion.
> 
> NB:
> 
> Jeff.
> 
> On 22/09/2004, at 7:54 AM, martinus wrote:
> 
> > I have created a little text analyzation tool, that tries to extract
> >  words that are important in a given text. I have implemented one of my
> >  strange ideas, and to my own surprise, it works. I have no idea if any
> >  similar tool exists, so I do not know where to post this. It is 
> > written
> >  in Ruby, so I just post it here :-)
> >
> > To use this tool, you first have to index a large amount of text files.
> >  It generates an index, which is later used when analyzing text.
> >
> > For example, I have indexed several fairy tales, and used this index to
> >  extract important words. Here are some results:
> >
> > Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
> >  Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
> >  Alladin.txt: aladdin, lamp, genie, sultan, wizard
> >
> > The algorithm works with HTML files and probably any other format that
> >  contains text, Here is an example of analyzation results when HTML
> >  files are indexed:
> >
> > SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
> >  META-FAQ.html: newsgroup, comp, sunsite, questions, announce
> >  TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive
> >
> > And now my question: Does anyone know where to find such tools or
> >  algorithms?
> >
> > You can get it from here, it's public domain:
> > http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer
> >
> > martinus
> >
> >
>