Trans wrote:
> On Mar 22, 6:00 am, Clifford Heath <n... / spam.please.net> wrote:
>> Your analysis is well-meaning, but flawed.
> General compression is all about removing repetition.

This is a very concise statement about what's flawed in your analysis.
It's not just about repetition, it's about removing *whatever* is
predictable. All's fair when it comes to improving predictability.
If I cared enough to write a compressor that could recognise
Somerset Maugham's writing style and vocabulary, it could adapt
its general compression algorithm to that style and compress it
better. Sure the decompressor has to know his style as well, but
that's allowed - it can be a general compressor with specific modes
(like video, i386 opcodes, etc, as I mentioned).

> Try assign a
> number to every word in the dictionary and "compress" a piece of text
> with it. It won't help.

Funny you should say that because I did that once, building a dictionary
of the King James Bible (10.5Kwords) and Huffman coding the words. It
worked very well, thanks very much. It was better than deflate for the
purpose because you can start decompressing anywhere (though that's not
relevant to overall compression goodness).

> "How it does it" is the interesting part. How does BWT improve
> deflate? Ie. What's that tell us about deflate?

It tells us that there's more to the compressing typical data sets
than simply identifying repetition - there is data that encodes
higher order patterns that are not directly visible as repetition,
but are visible to a better predictor.