2007/9/18, William James <w_a_x_man / yahoo.com>:
> On Sep 17, 4:13 pm, Robert Klemme <shortcut... / googlemail.com> wrote:
> > On 17.09.2007 21:49, William James wrote:
> >
> > > On Sep 17, 1:00 pm, Alex Shulgin <alex.shul... / gmail.com> wrote:
> > >> On Sep 17, 6:19 pm, William James <w_a_x_... / yahoo.com> wrote:
> >
> > >>> Awk is a very popular tool for text processing, but there is no
> > >>> way to make it treat a sequence of whitespace characters as a
> > >>> record-separator. So in awk, as in Ruby, text is almost always
> > >>> read a line at a time.
> > >> I thought Ruby is not just a text processing tool, but a general
> > >> purpose programming language.
> >
> > > You thought correctly.  But when you talk about reading a word at
> > > at time from a text file, you're talking about text processing.
> > > The point is that languages (including Ruby) that were designed
> > > to be very good at processing text usually read a line at a time,
> > > not a word at a time.  (A language that is very good at processing
> > > text can still be a general purpose language.)  Reading a word at
> > > a time seems to me to be odd and unnecessary, and I do a lot of
> > > text processing.  However, here's one way to do it.  (It would be
> > > a lot more efficient to read by lines.)
> >
> > > class IO
> > >   def get_word
> > >     word = nil
> > >     while c = self.read(1)
> > >       if c =~ /\s/
> > >         break if word
> > >       else
> > >         word||=""
> > >         word << c
> > >       end
> > >     end
> > >     word
> > >   end
> > > end
> >
> > > File.open('data'){|file|
> > >   while w = file.get_word
> > >     p w
> > >   end
> > > }
> >
> > I'd probably encapsulate the word reading in a module so the
> > implementation can be reused and exchanged if necessary:
> >
> > module WordIO
> >    def each_word(&b)
> >      each do |line|
> >        line.scan(/\w+/, &b)
> >      end
> >    end
> > end
> >
> > class IO
> >    include WordIO
> >
> >    def self.readwords(file)
> >      words = []
> >      open(file) {|io| io.each_word {|wd| words << wd}}
> >      words
> >    end
> > end
> >
> > ARGF.extend WordIO
> >
> > # additional goody
> > class String
> >    include WordIO
> > end
> >
> > :-)
> >
> > Kind regards
> >
> >         robert
>
> Very sophisticated.
>
> Since the o.p. wants whitespace as the word-separator,
> the reg.exp. should be changed to /\S+/.

See also Bertram's remark. Btw, that's probably also the reason why
this is not in the standard: there is probably no one size fits all
definition of "word". We have seen at least two so far and I reckon
there are more. :-)

> But, dang it all, I'm gonna say you're cheating because
> you're still reading lines behind the scenes!

;-)  But I said the implementation can be exchanged.

> Reading lines and breaking them into words is a lot
> easier than reading characters and constructing words.

Correct.  But just a bit:

module WordIO
  def wchar?(c)
    /\A\w\z/ =~ c.chr
  end

  def each_word
    word = nil
    while ( c = getc )
      if wchar? c
         (word ||= "") << c
      else
        yield word if word
        word = nil
      end
    end
    self
  end
end

Kind regards

robert