huffman-coding

Command line application to try Huffman code for lossless file compression.

Important

The cli still supports txt files only

Huffman code is a type of optimal prefix code. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman. Huffman algorithm is a greedy algorithm that uses a frequency-sorted binary tree and generates "prefix-free codes", in other words the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol.

How it works

Suppose that we have to compress a file that contains the following sentence: "Nel mezzo del cammin", the characters are:

z: 2 (11.76%)
o: 1 (5.88%)
l: 2 (11.76%)
d: 1 (5.88%)
e: 3 (17.65%)
c: 1 (5.88%)
a: 1 (5.88%)
i: 1 (5.88%)
n: 2 (11.76%)
m: 3 (17.65%)

Characters are lower cased and the space character is not included, this is only for example purposes.

Note that

ASCII encoding takes 8 bit for each char (8n bit)
Alphabet encoding takes 4 bit for each char (4n bit)
- 0000: z
- 0001: o
- and so on

What about Huffman? let's see

The example leads to the following dictionary:

i: 000
l: 001
z: 010
n: 011
d: 1000
o: 1001
c: 1010
a: 1011
e: 110
m: 111

The number of bits required is

$$\begin{aligned} ((11.76\% * 3) + (5.88\% * 4) + (11.76\% * 3) + (5.88\% * 4) &+ \\\ (17.65\% * 3) + (5.88\% * 4) + (5.88\% * 4) + (5.88\% * 3) &+ \\\ (11.76\% * 3) + (17.65\% * 3))*n &= \\\ &= 3.2346n \\\ \end{aligned}$$

Note that

$$3.2346n < 4n < 8n$$

How to try

git clone https://github.com/micheledinelli/huffman-coding.git

go build -o huffman

Once built list all the available commands with

./huffman --help

Along with the source code there is also a file named divine-comedy-it.txt which is the divine comedy by Dante Alighieri in Italian in .txt format. It can be used to test the Huffman code.

./huffman c divine-comedy-it.txt

divine-comedy-it.txt is compressed in divine-comedy-it.txt.bin which is 280KB almost half of the size of the original one (543KB). The cli also outputs metadata (divine-comedy-it.txt.metadata), the unique encoding for a specific file in order to be able to decompress it later.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cmd		cmd
encoder		encoder
img		img
queue		queue
workers		workers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
divine-comedy-it.txt		divine-comedy-it.txt
go.mod		go.mod
go.sum		go.sum
huffman.go		huffman.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

huffman-coding

How it works

How to try

About

Languages

License

micheledinelli/huffman-coding

Folders and files

Latest commit

History

Repository files navigation

huffman-coding

How it works

How to try

About

Topics

Resources

License

Stars

Watchers

Forks

Languages