On Sep 17, 4:13 pm, Robert Klemme <shortcut... / googlemail.com> wrote:
> On 17.09.2007 21:49, William James wrote:
>
> > On Sep 17, 1:00 pm, Alex Shulgin <alex.shul... / gmail.com> wrote:
> >> On Sep 17, 6:19 pm, William James <w_a_x_... / yahoo.com> wrote:
>
> >>> Awk is a very popular tool for text processing, but there is no
> >>> way to make it treat a sequence of whitespace characters as a
> >>> record-separator. So in awk, as in Ruby, text is almost always
> >>> read a line at a time.
> >> I thought Ruby is not just a text processing tool, but a general
> >> purpose programming language.
>
> > You thought correctly.  But when you talk about reading a word at
> > at time from a text file, you're talking about text processing.
> > The point is that languages (including Ruby) that were designed
> > to be very good at processing text usually read a line at a time,
> > not a word at a time.  (A language that is very good at processing
> > text can still be a general purpose language.)  Reading a word at
> > a time seems to me to be odd and unnecessary, and I do a lot of
> > text processing.  However, here's one way to do it.  (It would be
> > a lot more efficient to read by lines.)
>
> > class IO
> >   def get_word
> >     word = nil
> >     while c = self.read(1)
> >       if c =~ /\s/
> >         break if word
> >       else
> >         word||=""
> >         word << c
> >       end
> >     end
> >     word
> >   end
> > end
>
> > File.open('data'){|file|
> >   while w = file.get_word
> >     p w
> >   end
> > }
>
> I'd probably encapsulate the word reading in a module so the
> implementation can be reused and exchanged if necessary:
>
> module WordIO
>    def each_word(&b)
>      each do |line|
>        line.scan(/\w+/, &b)
>      end
>    end
> end
>
> class IO
>    include WordIO
>
>    def self.readwords(file)
>      words = []
>      open(file) {|io| io.each_word {|wd| words << wd}}
>      words
>    end
> end
>
> ARGF.extend WordIO
>
> # additional goody
> class String
>    include WordIO
> end
>
> :-)
>
> Kind regards
>
>         robert

Very sophisticated.

Since the o.p. wants whitespace as the word-separator,
the reg.exp. should be changed to /\S+/.

But, dang it all, I'm gonna say you're cheating because
you're still reading lines behind the scenes!
Reading lines and breaking them into words is a lot
easier than reading characters and constructing words.