Command line application to try Huffman code for lossless file compression.
Important
The cli still supports txt
files only
Huffman code is a type of optimal prefix code. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman. Huffman algorithm is a greedy algorithm that uses a frequency-sorted binary tree and generates "prefix-free codes", in other words the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol.
Suppose that we have to compress a file that contains the following sentence: "Nel mezzo del cammin", the characters are:
- z: 2 (11.76%)
- o: 1 (5.88%)
- l: 2 (11.76%)
- d: 1 (5.88%)
- e: 3 (17.65%)
- c: 1 (5.88%)
- a: 1 (5.88%)
- i: 1 (5.88%)
- n: 2 (11.76%)
- m: 3 (17.65%)
Characters are lower cased and the space character is not included, this is only for example purposes.
Note that
- ASCII encoding takes 8 bit for each char (8n bit)
- Alphabet encoding takes 4 bit for each char (4n bit)
- 0000: z
- 0001: o
- and so on
What about Huffman? let's see
The example leads to the following dictionary:
- i: 000
- l: 001
- z: 010
- n: 011
- d: 1000
- o: 1001
- c: 1010
- a: 1011
- e: 110
- m: 111
The number of bits required is
Note that
git clone https://github.com/micheledinelli/huffman-coding.git
go build -o huffman
Once built list all the available commands with
./huffman --help
Along with the source code there is also a file named divine-comedy-it.txt
which is the divine comedy by Dante Alighieri in Italian in .txt
format. It can be used to test the Huffman code.
./huffman c divine-comedy-it.txt
divine-comedy-it.txt
is compressed in divine-comedy-it.txt.bin
which is 280KB almost half of the size of the original one (543KB).
The cli also outputs metadata (divine-comedy-it.txt.metadata
), the unique encoding for a specific file in order to be able to decompress it later.