How Does Data Compression Work?
A 7-minute read
Zip files do not actually shrink data. They find patterns and write shorthand for them, saving space without losing a single bit.
Every file on your computer is stored as a long string of ones and zeros. Data compression is the art of rewriting that string into a shorter one without losing the ability to get the original back.
The short answer
Data compression finds patterns in data and replaces repeated patterns with shorter references. Lossless compression lets you reconstruct the original exactly. Lossy compression removes detail that human senses cannot perceive anyway, achieving much smaller files at the cost of perfect reconstruction.
The full picture
Finding patterns
Imagine you receive a note that says:
AAAAABBBBBCCCCCDAAAAABBBBBCCCCCDAAAAABBBBBCCCCC
You could rewrite it as:
5A5B5C3D5A5B5C3D5A5B5C
You communicated the same information using fewer characters. This is exactly what data compression does, except computers look for patterns in binary data.
Lossless compression
Lossless compression lets you reconstruct the original file exactly. The original ones and zeros come back byte-for-byte. This is what you use for software, documents, and spreadsheets.
The most common technique is called dictionary-based compression, made famous by the Lempel-Ziv algorithm family (LZ77, LZ78, LZW). The idea is simple: whenever you see a repeated pattern, you replace it with a shorter reference to where it appeared before.
Consider the sentence:
the quick brown fox jumps over the lazy dog
The word “the” appears twice. A compressor might replace the second “the” with a reference that says “go back 13 characters and copy 3 bytes”. This saves space.
Huffman coding is another technique. It assigns shorter codes to common symbols. In English text, the letter “E” appears far more often than “Z”. A Huffman compressor might use just 3 bits for “E” while using 10 bits for “Z”, saving space across millions of characters.
The DEFLATE algorithm used in ZIP files combines both approaches: LZ77 for patterns plus Huffman for the resulting symbols.
Lossy compression
Lossy compression removes information that humans cannot perceive. This is primarily used for images, audio, and video where some quality loss is acceptable.
JPEG images use a clever trick. They divide the image into 8x8 pixel blocks and apply a mathematical transform called DCT (Discrete Cosine Transform). This converts the pixels into frequency components. Then it discards the high-frequency details that human eyes barely notice.
MP3 audio works similarly. It analyzes sound waves and removes frequencies that human ears cannot hear, particularly quiet sounds that occur simultaneously with louder ones.
Video compression goes further. Most video codecs like H.264 and H.265 exploit the fact that consecutive frames are often very similar. They only store the differences between frames, plus occasional “key frames” that serve as reference points.
Common misconceptions
Many people think zipping a file shrinks it somehow. The data inside is actually rewritten using shorthand. The file appears smaller because the shorthand takes fewer bytes to store. When you unzip, the decompression algorithm expands the shorthand back to the original.
Another misconception is that compression always works. Compressing already-compressed data like MP3s, JPEGs, or ZIPs yields almost no reduction. These formats are already optimized, so there are no patterns left to find.
Why it matters
When you zip a folder, you are using lossless compression. Every file inside will be reconstructed exactly. When you save an image as JPEG, you are using lossy compression.
The compression ratio you achieve depends entirely on the data. Plain text often compresses to 50% of its original size. Already-compressed formats like MP3s barely shrink at all. This is why sending a ZIP file containing already-compressed files wastes effort.
External sources
Key terms
- Lossless compression: Keeps all original data, allows exact reconstruction
- Lossy compression: Discards imperceptible detail for smaller sizes
- DEFLATE: The algorithm used in ZIP files, combining LZ77 and Huffman coding
- Dictionary-based compression: Replaces repeated patterns with references
- Huffman coding: Assigns shorter codes to more common symbols