I used a bit of freedom:
* It reads from STDIN instead of ARGV.
* Output is binary instead of 0-and-1-as-text.
* Both encoding and decoding, for the extra points...
* In order to avoid juggling with padding characters, the least
frequent character is 1110 instead of 111 (in the example in
the quiz). Every "character" in the encoded bit stream ends
with a 0. Yes, it costs one bit, but since it's not used
frequently, it doesn't really hurt.
* 0 and 1 become 1 and 0, respectively, because we get the
padding zeroes for free when we do "001".pack("B*"). So,
rewriting the previous section: In order to avoid juggling
with padding characters, the least frequent character is 0001
instead of 000. Every "character" in the encoded bit stream
ends with a 1.
Here's the code: http://tinyurl.com/3x5nsy
Example:
$ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
Compression: 14/16 (87%)
Decompression: 14/16 (87%)
Scooby-Dooby-Doo
How does it work?
First of all, we count each unique byte in the original byte
stream (e.g. "Scooby-Dooby-Doo"):
{"-"=>2, "b"=>2, "y"=>2, "c"=>1, "D"=>2, "o"=>6, "S"=>1}
... which is sorted:
[["o", 6], ["-", 2], ["D", 2], ["b", 2], ["y", 2], ["S", 1], ["c",
1]]
... and cleaned:
["o", "-", "D", "b", "y", "S", "c"]
The leftmost character is used most often and the rightmost
character is used least often. We call this ordered list the
"alphabet".
Secondly, we build the translation table, where each character
in the "alphabet" is mapped to a bit sequence:
{"b"=>"0001", "-"=>"01", "c"=>"0000001", "y"=>"00001", "D"=>"001",
"o"=>"1", "S"=>"000001"}
We can translated each byte of the original byte stream:
["S", "c", "o", "o", "b", "y", "-", "D", "o", "o", "b", "y", "-",
"D", "o", "o"]
... to the corresponding bit stream:
["000001", "0000001", "1", "1", "0001", "00001", "01", "001", "1",
"1", "0001", "00001", "01", "001", "1", "1"]
... which is joined into a string:
"00000100000011100010000101001110001000010100111"
... which is packed (with pack("B*")) into another string:
"\004\016!N!N"
... which we'll call the "encoded bit stream".
In order to be able to decode the message, we have to serialize
the "alphabet" and concatenate it with the "encoded bit stream":
alphabet + bit stream -> message
"o-DbySc" + "\004\016!N!N" -> "o-DbySc\004\016!N!N"
But where's the boundary between "alphabet" and "encoded bit
stream"? Well, let's add the length of the "alphabet"
([7]..pack("C")):
len + alphabet + bit stream -> message
"\a" + "o-DbySc" + "\004\016!N!N" -> "\ao-DbySc\004\016!N!N"
Conclusion? "Scooby-Dooby-Doo" becomes "\ao-DbySc\004\016!N!N".
Well, we've saved 2 bytes, but it doesn't sound that well...
gegroet,
Erik V. - http://www.erikveen.dds.nl/