On 3/20/07, Clifford Heath <no.spam / please.net> wrote: > Trans wrote: > > Really? I find that surprising actually. I would not think a sorting > > algorithm could lead to any significant compression. It just doesn't > > make sense under conservation of information. I assume it's due to > > deficiencies in deflate then. > > deflate works by substituting short codes for repeated text, then > Huffman packing the codes and the first instance of each text. > BWT creates more repetitions by grouping similar text, so the > compression has more to chew on. And the "gigantomania" example given isn't a particularly good demonstration of how the transform tends to cluster repetitions. Quoting from the wikipedia article: "To understand why this creates more-easily-compressible data, let's consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text)." The article uses the word bananna which is a better example. -- Rick DeNatale My blog on Ruby http://talklikeaduck.denhaven2.com/