On Jan 20, 7:04 pm, Dave Thomas <d... / pragprog.com> wrote:
> My hack at the substring problem (nice one, James) is based on suffix
> trees.
>
> Suffix trees and suffix arrays and a great tool for substring
> operations. The idea is to create a list of all the possible suffixes
> in the original string. For "banana" this would be
>
> banana
> anana
> nana
> ana
> na
> a
>
> Because the list contains an entry starting at every position in the
> string, we know that all possible substrings of the original string
> must occur at the start of one of these list elements. The substring
> "nan", for example, occurs at the start of the 3rd entry.
>
> Now sort them
>
> a
> ana
> anana
> banana
> na
> nana
>
> Now we know that all substrings that start with the same sequence of
> characters must occur at the start of items in the list that are
> adjacent. Searching through to list to find these is a linear operation.

Hi Dave,

Thanks for the brilliant idea--this time I really feel like I've
learned something new! :)

I even took my time to implement it in C (80 lines) and I must admit
that ruby performs very well in this case: my C version is only about
6 times faster than your ruby...

However, with the algorithm presented, I don't see how one can obtain
the correct result for the "aaaa" case.  Lets see; the suffix list
would be:

aaaa
aaa
aa
a

now, sorted:

a
aa
aaa
aaaa

And the catch is--in order to obtain the correct result, that is "aa",
one needs to observe "aa" and "aaaa", but these are not adjacent.  In
this case two pairs of adjacent suffixes ("aa" and "aaa", "aaa" and
"aaaa") might give "aa", but they do overlap.

So do we need a special-case treatment here?

--
Cheers,
Alex