David A. Black wrote:

> On Fri, 4 Nov 2005, Nikolai Weibull wrote:

> > David A. Black wrote:

> > > I'm thinking of cases like this:
> > >
> > >   re = /abc.*def/
> > >
> > > The first chunk out of the file might match this -- but then you'd
> > > have to keep going, really until EOF, to get the greedy match if
> > > it's there.  Then you'd have to go back.

> > Well, think of it like this instead.  The Regexp simply reads from
> > the input source when it needs more data.  The Regexp will
> > concatenate the new data with the old and continue on its matching
> > routine.  We build the input as we go along, i.e., were in a sense
> > dealing with implementing lazy Strings.  This wont cause any issues
> > with backtracking, as the data will still be there.

> In the /abc.*def/ case, though, you'd always have to take all the
> input (at least up to the third-to-last character in the file), even
> if you had an intermediate match.  So "needs more data" would not be
> something the regex could tell you.  It would say, "Yes, there's a
> match", but you would have to know that the "yes" didn't mean you
> could stop.

.* needs more data until there is no more data (#read returns nil), then
it fails as it hasnt been able to match 'def' and backtracks until that
part of the regex does.  Then it has a match.  (This ignores newline
conventions, but lets ignore them for now.)  You have the same problem
when doing this on a regular string.

> But if the regex were /abc.*?def/, then as soon as there was a "yes",
> you could stop.

> There's also a question of: if the first 4096 bytes started with "abc"
> and ended with "de", then you'd add the next 4096 -- but you'd have to
> perform the match again.  Or else you'd have to know to rewind by
> exactly two characters.  But if you're changing where you start the
> match, that could affect how anchors worked.

Why?  If the first character in the next 4096 bytes is a "f" wed have a
match.  If were using .*? were are done.  If were using .* we
wouldnt have begun matching the "de" against the /de/.  What would be
an issue would be how to treat MatchData#post_match.  Itd have to be
the remaining data that wasnt matched at the time of a match, not all
the possible data that may come from the source.

        nikolai

-- 
Nikolai Weibull: now available free of charge at http://bitwi.se/!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}