On 7/24/07, seijin / gmail.com <seijin / gmail.com> wrote:
>
> First, why does ... show_regexp("banana", /(an)*/) ... not match
> "anan" ?  I thought it was a greedy algorithm that tried to match as
> much as possible?

It does, but it also returns the *first* available match, which in the
case of 'banana' is the empty string. Note that * matches 0 or more
iterations. + will work as expected:

>> show_regexp("banana", /(an)+/)
=> "b<<anan>>a"

> Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
> Why in the world does it match "ississ" rather than returning no
> match?  I think most of my problem with this one is not understanding
> the underlying logic used when doing pattern matching with back
> references.  Does it go through and check "Mississippi", "Mississipp"
> down to "M" and then "i", "is", "iss", etc?  Like trying all possible
> combinations in a lock?  I would be extremely grateful if someone
> would do a short step-by-step of how it matches "ississ".  I
> understand what the "\1" does, I just don't understand how the first
> part even gets to the first "iss".

(\w+)\1 matches one or more letters, twice in a row. The empty string
doesn't match the "one or more bit". This is just the opposite case to
your last question - (\w*)\1 matches the empty string. What the engine
does is match (\w+) against M, then \1 against another M which fails.
Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to
i, is, iss... when it suddenly succeeds. Illustrative examples:

>> show_regexp("Mississippi", /(\w*)\1/)
=> "<<>>Mississippi"
>> show_regexp("MississippiMississippi", /(\w*)\1/)
=> "<<MississippiMississippi>>"
>> show_regexp("bbMississippiMississippi", /(\w*)\1/)
=> "<<bb>>MississippiMississippi"

martin

martin