On 7/24/07, seijin / gmail.com <seijin / gmail.com> wrote: > > First, why does ... show_regexp("banana", /(an)*/) ... not match > "anan" ? I thought it was a greedy algorithm that tried to match as > much as possible? It does, but it also returns the *first* available match, which in the case of 'banana' is the empty string. Note that * matches 0 or more iterations. + will work as expected: >> show_regexp("banana", /(an)+/) => "b<<anan>>a" > Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work? > Why in the world does it match "ississ" rather than returning no > match? I think most of my problem with this one is not understanding > the underlying logic used when doing pattern matching with back > references. Does it go through and check "Mississippi", "Mississipp" > down to "M" and then "i", "is", "iss", etc? Like trying all possible > combinations in a lock? I would be extremely grateful if someone > would do a short step-by-step of how it matches "ississ". I > understand what the "\1" does, I just don't understand how the first > part even gets to the first "iss". (\w+)\1 matches one or more letters, twice in a row. The empty string doesn't match the "one or more bit". This is just the opposite case to your last question - (\w*)\1 matches the empty string. What the engine does is match (\w+) against M, then \1 against another M which fails. Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to i, is, iss... when it suddenly succeeds. Illustrative examples: >> show_regexp("Mississippi", /(\w*)\1/) => "<<>>Mississippi" >> show_regexp("MississippiMississippi", /(\w*)\1/) => "<<MississippiMississippi>>" >> show_regexp("bbMississippiMississippi", /(\w*)\1/) => "<<bb>>MississippiMississippi" martin martin