Compressing the same document in 528 languages
The Universal Declaration of Human Rights is the most translated document in the world. Adopted by the United Nations in 1948, it has been translated into over 500 languages — from Afrikaans to Zulu, from alphabets with 26 letters to writing systems with thousands of characters.
Every translation says the same thing. Thirty articles of identical semantic content, expressed in wildly different codes. This makes the UDHR a natural experiment: if we measure the information content of each translation, we can ask a fundamental question about human language itself.
How much information does it take to say the same thing?
The answer, it turns out, is remarkably consistent. Despite surface-level differences that span a factor of three, when we strip away the encoding overhead — when we compress each translation to its informational essence — they converge to a narrow band. The same meaning, in the same number of bits, regardless of the language.
Let's start with the raw numbers. When we encode each UDHR translation as UTF-8 text (the modern standard for digital text), the file sizes vary enormously. The most compact translations weigh around 5,000 bytes. The largest exceed 20,000. Why?
The answer lies in the writing systems. Latin-alphabet languages like English encode each letter as a single byte. Cyrillic and Arabic scripts need two bytes per character in UTF-8. Devanagari, Thai, and Tibetan need three. Chinese characters also need three bytes each — but since each character encodes a morpheme rather than a phoneme, Chinese needs far fewer characters to say the same thing.
This creates a counterintuitive situation: Chinese text is shorter in characters but larger in bytes than you might expect, because each character is individually expensive to encode. Meanwhile, Samoan — with one of the smallest alphabets in the world (just 14 letters plus a glottal stop) — is extraordinarily byte-efficient.
Each dot is one language. Hover for details. Click legend items to filter.
Raw file size is a poor measure of information content. It conflates the message with the medium — the encoding overhead of the writing system with the actual information being conveyed. To see past this, we need a way to measure information independent of encoding.
This is exactly what data compression does. A compression algorithm like LZMA finds and exploits every statistical pattern in the data: repeated substrings, character frequency distributions, long-range dependencies. What remains after compression — the incompressible core — is a remarkably good proxy for the text's Kolmogorov complexity: the minimum number of bits needed to describe the information.
When we compress all 528 translations with LZMA, the landscape transforms.
Toggle between raw and compressed size. The convergence is the point.
Languages sorted by compressed size. The colored bar is raw size; it shows how much redundancy compression removes.
Shannon entropy measures the average information per character — how “surprising” each character is in context. A language that uses 14 characters has low entropy per character (each character is relatively predictable from the small set). A language that uses 500 characters has high entropy (each character could be one of many).
But this is entropy per character, not entropy per unit of meaning. Languages with high character entropy compensate by needing fewer characters. This is why the correlation between character entropy and unique character count is 0.924 — it's nearly a straight line, because the writing systems are engineered to be efficient: more choices per symbol means fewer symbols needed.
Character entropy vs. unique characters, colored by script. The tighter the cluster, the more similar the encoding strategy.
The bigram entropy (the entropy of character pairs — how much each character tells you about the next one) reveals another dimension. Languages with complex morphology and long words have low bigram entropy: the next character is highly predictable from the current one. Isolating languages with short words have higher bigram entropy: each character is more independent of its neighbors.
Language isn't distributed uniformly across the globe. Language families cluster geographically, and different regions developed different writing system traditions. When we plot each language at its geographic origin, colored by some information-theoretic property, we see these clusters emerge naturally.
Each dot is a language, plotted at its geographic origin. Some positions are approximate.
Search and sort through all 528 languages. Click column headers to sort.
| Language | Script | Chars | Entropy | Raw | LZMA | Ratio |
|---|
The convergence of compressed size across 528 languages is the central finding. Despite enormous differences in writing systems, morphology, phonology, and syntax, the compressed representation of the same semantic content varies by only about 19%.
This suggests several things:
What accounts for the residual variation? Some hypotheses:
Data: 528 UDHR translations from the Unicode UDHR project (stage 3+, over 2,000 characters). 42 writing systems, 16 language families represented.
For each translation, we extracted the plain text (stripping XML markup), then computed:
H = -Σ p(c) log&sub2; p(c) over all characters in the text, excluding whitespaceH(X|Y) = H(X,Y) - H(Y), measuring how much each character predicts the nextLZMA compressed size is used as the primary information-content proxy because it achieves the best compression among algorithms tested, and it exploits both local and long-range statistical patterns in the text.
All text was encoded as UTF-8 before compression. This means the raw byte count reflects the UTF-8 encoding cost (1 byte for ASCII/Latin, 2 for Cyrillic/Arabic/Hebrew, 3 for CJK/Devanagari/Thai/etc.). The compressed byte count largely removes this encoding bias, as compression exploits the redundancy in multi-byte sequences.
Data: Unicode UDHR Project. Analysis and visualization by Claude (Anthropic). March 2026.