On Thu, 1 Mar 2001, Arno Erpenbeck wrote:

> Greetings everybody,
> 
> maybe somebody can help me with this: How can I collect n-grams (i.e.
> tuples of characters/words/whatever) from plain text? I tried something
> like this:
> 
> while line = gets
>   line.gsub(/[a-zA-Z\s]{3,3}/) {|p| print "#{p},"}
> end
> 
> However, this makes "too big" steps because the regexp matches one
> triple and then the next one behind it, but no overlaps. There must be a
> simple solution I guess.
> 
> Example:
> Input "The man sees the boy with the telescope."
> Output "The, ma,n s,ees, th,e b,oy ,wit,h t,he ,tel,esc,ope,"
> Desired output "The,he ,e m, ma,man,..."

Quick first try (I know nothing about any ngram theory that may
exist, so I don't know whether certain things are right, such
as end-boundary behavior):

  class String
    def ngrams(len=1)
      ngrams = []
      (0..size - len).each do |n|
	 ng = self[n...n+len]
	 ngrams.push(ng)
	 yield ng if block_given?
       end
       ngrams
    end
  end

  str = "I am a string."
  p str.ngrams(5)
  str.ngrams(3) do |s| print "(%s)" % s end

=>

["I am ", " am a", "am a ", "m a s", " a st",
"a str", " stri", "strin", "tring", "ring."]

(I a)( am)(am )(m a)( a )(a s)( st)(str)(tri)(rin)(ing)(ng.)


> BTW: If this list is not intended for questions of this kind, please let
> me know, and I will go and look somewhere else.

We mainly use it to discuss the weather, but occasional interesting
questions related to the Ruby programming language are tolerated :-)


David

-- 
David Alan Black
home: dblack / candle.superlink.net
work: blackdav / shu.edu
Web:  http://pirate.shu.edu/~blackdav