Thought I'd given this simple program a go and review it for those 
curious as to how well it works. Two tests were carried out, the first 
I only used the following texts. The second repeated the first with 
additional training texts. The texts are from the project gutenburg 
(except openbsd35.readme.txt which for some reason was in the same 
directory).

$ ls *.txt
8ldvc10.txt             openbsd35.readme.txt
grimm10.txt             sunzu10.txt

$ cat *.txt |ruby textanalyze.rb c
reading...
Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per 
second
Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per 
second
Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per 
second
Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per 
second
Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per 
second
Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per 
second
Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per 
second
Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per 
second
storing into wordcount.dat...
Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per 
second

I then fed it an text version of my marketing essay which should have 
very little if anything in common with the training texts.

$ cat ../assignment1.txt|ruby textanalyze.rb a
loading wordcount.dat...
reading...
analyzing...
most characteristic words:
marketing, customers, customer, purchase, interaction

I then added more texts

$ ls *.txt
8ldvc10.txt             openbsd35.readme.txt    tprnc11.txt
dracu13.txt             repub11.txt             warw12.txt
grimm10.txt             sunzu10.txt

and reran the above creation and analyze steps. to get

most characteristic words:
marketing, customers, customer, 4ps, interaction

So, not bad for such a simple algorithm. As I would have picked the 
keywords as relationship, marketing, 4Ps,
  and customer retention. I'm surprised coffee didn't show up as I kept 
using it in examples. It don't do too badly in this simple test 
especially considering that the training test was chosen at random and 
not related to the text analyzed. A dictionary of plurals or some other 
means of dealing with plurals would be my only suggestion.

NB:

Jeff.

On 22/09/2004, at 7:54 AM, martinus wrote:

> I have created a little text analyzation tool, that tries to extract
>  words that are important in a given text. I have implemented one of my
>  strange ideas, and to my own surprise, it works. I have no idea if any
>  similar tool exists, so I do not know where to post this. It is 
> written
>  in Ruby, so I just post it here :-)
>
> To use this tool, you first have to index a large amount of text files.
>  It generates an index, which is later used when analyzing text.
>
> For example, I have indexed several fairy tales, and used this index to
>  extract important words. Here are some results:
>
> Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
>  Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
>  Alladin.txt: aladdin, lamp, genie, sultan, wizard
>
> The algorithm works with HTML files and probably any other format that
>  contains text, Here is an example of analyzation results when HTML
>  files are indexed:
>
> SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
>  META-FAQ.html: newsgroup, comp, sunsite, questions, announce
>  TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive
>
> And now my question: Does anyone know where to find such tools or
>  algorithms?
>
> You can get it from here, it's public domain:
> http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer
>
> martinus
>
>