On Mar 21, 8:47 am, "Rick DeNatale" <rick.denat... / gmail.com> wrote:
> On 3/20/07, Clifford Heath <no.s... / please.net> wrote:
>
> > Trans wrote:
> > > Really? I find that surprising actually. I would not think a sorting
> > > algorithm could lead to any significant compression. It just doesn't
> > > make sense under conservation of information. I assume it's due to
> > > deficiencies in deflate then.
>
> > deflate works by substituting short codes for repeated text, then
> > Huffman packing the codes and the first instance of each text.
> > BWT creates more repetitions by grouping similar text, so the
> > compression has more to chew on.
>
> And the "gigantomania" example given isn't a particularly good
> demonstration of how the transform tends to cluster repetitions.
>
> Quoting from the wikipedia article:
>
> "To understand why this creates more-easily-compressible data, let's
> consider transforming a long English text frequently containing the
> word "the". Sorting the rotations of this text will often group
> rotations starting with "he " together, and the last character of that
> rotation (which is also the character before the "he ") will usually
> be "t", so the result of the transform would contain a number of "t"
> characters along with the perhaps less-common exceptions (such as if
> it contains "Brahe ") mixed in. So it can be seen that the success of
> this transform depends upon one value having a high probability of
> occurring before a sequence, so that in general it needs fairly long
> samples (a few kilobytes at least) of appropriate data (such as
> text)."
>
> The article uses the word bananna which is a better example.

I think one needs to be careful in understanding this though. The
sorting only provides a benefit in contrast to less efficient
algorithms. No amount of reordering will ever make a difference to the
lower bound of compressibility. That's the crux of my point. And
hence, for a better algorithm, this kind of sorting should be a moot
point. (Of course, one could argue that this sorting Is the better
algorithm -- fair enough.)

T.


> --
> Rick DeNatale
>
> My blog on Rubyhttp://talklikeaduck.denhaven2.com/