On Mar 21, 8:47 am, "Rick DeNatale" <rick.denat... / gmail.com> wrote: > On 3/20/07, Clifford Heath <no.s... / please.net> wrote: > > > Trans wrote: > > > Really? I find that surprising actually. I would not think a sorting > > > algorithm could lead to any significant compression. It just doesn't > > > make sense under conservation of information. I assume it's due to > > > deficiencies in deflate then. > > > deflate works by substituting short codes for repeated text, then > > Huffman packing the codes and the first instance of each text. > > BWT creates more repetitions by grouping similar text, so the > > compression has more to chew on. > > And the "gigantomania" example given isn't a particularly good > demonstration of how the transform tends to cluster repetitions. > > Quoting from the wikipedia article: > > "To understand why this creates more-easily-compressible data, let's > consider transforming a long English text frequently containing the > word "the". Sorting the rotations of this text will often group > rotations starting with "he " together, and the last character of that > rotation (which is also the character before the "he ") will usually > be "t", so the result of the transform would contain a number of "t" > characters along with the perhaps less-common exceptions (such as if > it contains "Brahe ") mixed in. So it can be seen that the success of > this transform depends upon one value having a high probability of > occurring before a sequence, so that in general it needs fairly long > samples (a few kilobytes at least) of appropriate data (such as > text)." > > The article uses the word bananna which is a better example. I think one needs to be careful in understanding this though. The sorting only provides a benefit in contrast to less efficient algorithms. No amount of reordering will ever make a difference to the lower bound of compressibility. That's the crux of my point. And hence, for a better algorithm, this kind of sorting should be a moot point. (Of course, one could argue that this sorting Is the better algorithm -- fair enough.) T. > -- > Rick DeNatale > > My blog on Rubyhttp://talklikeaduck.denhaven2.com/