--8323328-1498695871-11311057621229
Content-Type: MULTIPART/MIXED; BOUNDARY="8323328-1498695871-1131105762=:21229"

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--8323328-1498695871-11311057621229
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

Hi --

On Fri, 4 Nov 2005, Nikolai Weibull wrote:

> David A. Black wrote:
>
>> On Fri, 4 Nov 2005, Nikolai Weibull wrote:
>
>>> David A. Black wrote:
>
>>>> I'm thinking of cases like this:
>>>>
>>>>   re = /abc.*def/
>>>>
>>>> The first chunk out of the file might match this -- but then you'd
>>>> have to keep going, really until EOF, to get the greedy match if
>>>> it's there.  Then you'd have to go back.
>
>>> Well, think of it like this instead.  The Regexp simply reads from
>>> the input source when it needs more data.  The Regexp will
>>> concatenate the new data with the old and continue on its matching
>>> routine.  We build the input as we go along, i.e., wee in a sense
>>> dealing with implementing lazy Strings.  This won cause any issues
>>> with backtracking, as the data will still be there.
>
>> In the /abc.*def/ case, though, you'd always have to take all the
>> input (at least up to the third-to-last character in the file), even
>> if you had an intermediate match.  So "needs more data" would not be
>> something the regex could tell you.  It would say, "Yes, there's a
>> match", but you would have to know that the "yes" didn't mean you
>> could stop.
>
> .* needs more data until there is no more data (#read returns nil), then
> it fails as it hasn been able to match 'def' and backtracks until that
> part of the regex does.  Then it has a match.  (This ignores newline
> conventions, but let ignore them for now.)  You have the same problem
> when doing this on a regular string.

I understand what .* means :-)  My point is that you would have to
have some mechanism for telling the file stream that the regex was
doing a greedy .* and therefore needed more.  I'm not saying it's
impossible -- just that it changes the whole concept of the state of a
regex.

>> But if the regex were /abc.*?def/, then as soon as there was a "yes",
>> you could stop.
>
>> There's also a question of: if the first 4096 bytes started with "abc"
>> and ended with "de", then you'd add the next 4096 -- but you'd have to
>> perform the match again.  Or else you'd have to know to rewind by
>> exactly two characters.  But if you're changing where you start the
>> match, that could affect how anchors worked.
>
> Why?  If the first character in the next 4096 bytes is a "f" weave a
> match.

Again, I understand that def matches /def/ :-)  But I'm
troubleshooting the process by which you would determine that you have
a match.  Illustrating with chunks of eight bytes:

   str = "abc123de"
   /abc.*?def/.match(str)  # false
   str = "abc123def1234567"
   /abc.*?def/.match(str)  # true

My point is that you've done the match twice from the beginning.  This
could get very inefficient, if you have to re-match constantly.

You'd probably have to arrange to store where the match failed (in the
regex, not in the string) and resume from there.  I'm not sure whether
that's possible in general, but I'm sure the answer is known.

> If wee using .*? wee are done.  If wee using .* we
> wouldn have begun matching the "de" against the /de/.  What would be
> an issue would be how to treat MatchData#post_match.  It have to be
> the remaining data that wasn matched at the time of a match, not all
> the possible data that may come from the source.

That makes it a lot less transparent: post_match would be, in the def
example, 4095 bytes from a file.  That seems very "implementation
dependent".


David

-- 
David A. Black
dblack / wobblini.net
--8323328-1498695871-11311057621229--
--8323328-1498695871-11311057621229--