I decided to do a somewhat more ambitious test. After training on a thousand arbitrary .doc files and a thousand arbitrary .html files (and tweaking it to return the top 15 words instead of just the top 5) I fed it Why the lucky stiff's latest opus: loading wordcount.dat... reading... analyzing... most characteristic words: he, his, cham, dr, said, ruby, goat, method, irb, ree, paij, sentence, him, had, end Not bad at all. Although I haven't read it yet myself, this looks like a quite reasonable summary. I'm a little surprised at the absence of flugel and trisomatic, but perhaps WTLS has gotten less predictable in his vocabulary since the last time I read him. -- Markus On Tue, 2004-09-21 at 21:18, jm wrote: > Thought I'd given this simple program a go and review it for those > curious as to how well it works. Two tests were carried out, the first > I only used the following texts. The second repeated the first with > additional training texts. The texts are from the project gutenburg > (except openbsd35.readme.txt which for some reason was in the same > directory). > > $ ls *.txt > 8ldvc10.txt openbsd35.readme.txt > grimm10.txt sunzu10.txt > > $ cat *.txt |ruby textanalyze.rb c > reading... > Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per > second > Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per > second > Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per > second > Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per > second > Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per > second > Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per > second > Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per > second > Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per > second > storing into wordcount.dat... > Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per > second > > I then fed it an text version of my marketing essay which should have > very little if anything in common with the training texts. > > $ cat ../assignment1.txt|ruby textanalyze.rb a > loading wordcount.dat... > reading... > analyzing... > most characteristic words: > marketing, customers, customer, purchase, interaction > > I then added more texts > > $ ls *.txt > 8ldvc10.txt openbsd35.readme.txt tprnc11.txt > dracu13.txt repub11.txt warw12.txt > grimm10.txt sunzu10.txt > > and reran the above creation and analyze steps. to get > > most characteristic words: > marketing, customers, customer, 4ps, interaction > > So, not bad for such a simple algorithm. As I would have picked the > keywords as relationship, marketing, 4Ps, > and customer retention. I'm surprised coffee didn't show up as I kept > using it in examples. It don't do too badly in this simple test > especially considering that the training test was chosen at random and > not related to the text analyzed. A dictionary of plurals or some other > means of dealing with plurals would be my only suggestion. > > NB: > > Jeff. > > On 22/09/2004, at 7:54 AM, martinus wrote: > > > I have created a little text analyzation tool, that tries to extract > > words that are important in a given text. I have implemented one of my > > strange ideas, and to my own surprise, it works. I have no idea if any > > similar tool exists, so I do not know where to post this. It is > > written > > in Ruby, so I just post it here :-) > > > > To use this tool, you first have to index a large amount of text files. > > It generates an index, which is later used when analyzing text. > > > > For example, I have indexed several fairy tales, and used this index to > > extract important words. Here are some results: > > > > Little Red Riding Hood.txt: hood, grandma, riding, hunter, red > > Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters > > Alladin.txt: aladdin, lamp, genie, sultan, wizard > > > > The algorithm works with HTML files and probably any other format that > > contains text, Here is an example of analyzation results when HTML > > files are indexed: > > > > SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl > > META-FAQ.html: newsgroup, comp, sunsite, questions, announce > > TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive > > > > And now my question: Does anyone know where to find such tools or > > algorithms? > > > > You can get it from here, it's public domain: > > http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer > > > > martinus > > > > >