On 3/20/07, Clifford Heath <no.spam / please.net> wrote:
> Trans wrote:
> > Really? I find that surprising actually. I would not think a sorting
> > algorithm could lead to any significant compression. It just doesn't
> > make sense under conservation of information. I assume it's due to
> > deficiencies in deflate then.
>
> deflate works by substituting short codes for repeated text, then
> Huffman packing the codes and the first instance of each text.
> BWT creates more repetitions by grouping similar text, so the
> compression has more to chew on.

And the "gigantomania" example given isn't a particularly good
demonstration of how the transform tends to cluster repetitions.

Quoting from the wikipedia article:

"To understand why this creates more-easily-compressible data, let's
consider transforming a long English text frequently containing the
word "the". Sorting the rotations of this text will often group
rotations starting with "he " together, and the last character of that
rotation (which is also the character before the "he ") will usually
be "t", so the result of the transform would contain a number of "t"
characters along with the perhaps less-common exceptions (such as if
it contains "Brahe ") mixed in. So it can be seen that the success of
this transform depends upon one value having a high probability of
occurring before a sequence, so that in general it needs fairly long
samples (a few kilobytes at least) of appropriate data (such as
text)."

The article uses the word bananna which is a better example.
-- 
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/