The Information Content of Language

Compressing the same document in 528 languages

2026-03-30 information theorylinguistics

528

Languages

Writing Systems

3×

Raw Size Variation

19%

Compressed Variation

The Universal Declaration of Human Rights is the most translated document in the world. Adopted by the United Nations in 1948, it has been translated into over 500 languages — from Afrikaans to Zulu, from alphabets with 26 letters to writing systems with thousands of characters.

Every translation says the same thing. Thirty articles of identical semantic content, expressed in wildly different codes. This makes the UDHR a natural experiment: if we measure the information content of each translation, we can ask a fundamental question about human language itself.

How much information does it take to say the same thing?

The answer, it turns out, is remarkably consistent. Despite surface-level differences that span a factor of three, when we strip away the encoding overhead — when we compress each translation to its informational essence — they converge to a narrow band. The same meaning, in the same number of bits, regardless of the language.

The Surface: Encoding Diversity

Let's start with the raw numbers. When we encode each UDHR translation as UTF-8 text (the modern standard for digital text), the file sizes vary enormously. The most compact translations weigh around 5,000 bytes. The largest exceed 20,000. Why?

The answer lies in the writing systems. Latin-alphabet languages like English encode each letter as a single byte. Cyrillic and Arabic scripts need two bytes per character in UTF-8. Devanagari, Thai, and Tibetan need three. Chinese characters also need three bytes each — but since each character encodes a morpheme rather than a phoneme, Chinese needs far fewer characters to say the same thing.

This creates a counterintuitive situation: Chinese text is shorter in characters but larger in bytes than you might expect, because each character is individually expensive to encode. Meanwhile, Samoan — with one of the smallest alphabets in the world (just 14 letters plus a glottal stop) — is extraordinarily byte-efficient.

The Raw Encoding

Each dot is one language. Hover for details. Click legend items to filter.

X axis: Y axis: Color:

The Essence: Compression as Measurement

Raw file size is a poor measure of information content. It conflates the message with the medium — the encoding overhead of the writing system with the actual information being conveyed. To see past this, we need a way to measure information independent of encoding.

This is exactly what data compression does. A compression algorithm like LZMA finds and exploits every statistical pattern in the data: repeated substrings, character frequency distributions, long-range dependencies. What remains after compression — the incompressible core — is a remarkably good proxy for the text's Kolmogorov complexity: the minimum number of bits needed to describe the information.

When we compress all 528 translations with LZMA, the landscape transforms.

    Raw text ranges from ~5,000 to ~20,000+ bytes — a 3× spread.

    Compressed text clusters between ~3,000 and ~5,000 bytes — just 19% coefficient of variation.

    The same information, in roughly the same number of bits.

Before and After Compression

Toggle between raw and compressed size. The convergence is the point.

Languages sorted by compressed size. The colored bar is raw size; it shows how much redundancy compression removes.

The Entropy Landscape

Shannon entropy measures the average information per character — how “surprising” each character is in context. A language that uses 14 characters has low entropy per character (each character is relatively predictable from the small set). A language that uses 500 characters has high entropy (each character could be one of many).

But this is entropy per character, not entropy per unit of meaning. Languages with high character entropy compensate by needing fewer characters. This is why the correlation between character entropy and unique character count is 0.924 — it's nearly a straight line, because the writing systems are engineered to be efficient: more choices per symbol means fewer symbols needed.

The Entropy Landscape

Character entropy vs. unique characters, colored by script. The tighter the cluster, the more similar the encoding strategy.

Y axis:

The bigram entropy (the entropy of character pairs — how much each character tells you about the next one) reveals another dimension. Languages with complex morphology and long words have low bigram entropy: the next character is highly predictable from the current one. Isolating languages with short words have higher bigram entropy: each character is more independent of its neighbors.

The Geography of Encoding

Language isn't distributed uniformly across the globe. Language families cluster geographically, and different regions developed different writing system traditions. When we plot each language at its geographic origin, colored by some information-theoretic property, we see these clusters emerge naturally.

World Map

Each dot is a language, plotted at its geographic origin. Some positions are approximate.

Color by:

The Explorer

Search and sort through all 528 languages. Click column headers to sort.

Language	Script	Chars	Entropy	Raw	LZMA	Ratio

What This Tells Us

The convergence of compressed size across 528 languages is the central finding. Despite enormous differences in writing systems, morphology, phonology, and syntax, the compressed representation of the same semantic content varies by only about 19%.

This suggests several things:

Information content is a property of the message, not the encoding. The UDHR contains roughly 3,800 bytes of information, no matter what language it's written in. The additional thousands of bytes in the raw text are encoding overhead — the cost of the particular writing system and grammar, not the meaning itself.
Human languages are efficient encodings. No language wastes enormous amounts of space. The 19% residual variation may reflect genuine differences in how much contextual knowledge each language assumes (a language with more grammatical inflection might make some relations explicit that another leaves implicit), but it's remarkably small given the diversity of structures involved.
Writing systems face an entropy-efficiency tradeoff. Alphabets with few characters (like Samoan's 15 or Hawaiian's 13) have low entropy per character but need many characters per morpheme. Logographic systems like Chinese have high entropy per character but need few characters per concept. Both strategies arrive at roughly the same total information.

The remaining 19%

What accounts for the residual variation? Some hypotheses:

Morphological complexity. Agglutinative languages (Turkish, Finnish, Swahili) pack more information into each word through suffixes and prefixes. This can make texts either more or less compressible depending on how regular the morphology is.
Translation fidelity. Some translations may include additional explanatory text, cultural adaptations, or different formatting. The UDHR translations aren't perfectly parallel.
Compression algorithm bias. LZMA was designed primarily for Latin-alphabet text. It may not exploit the structure of other writing systems as efficiently.
Genuine information differences. Some languages may simply require more or fewer bits to express certain concepts, depending on the vocabulary and grammatical structures available.

Methodology

Data: 528 UDHR translations from the Unicode UDHR project (stage 3+, over 2,000 characters). 42 writing systems, 16 language families represented.

For each translation, we extracted the plain text (stripping XML markup), then computed:

Character entropy (Shannon, order-0): H = -Σ p(c) log&sub2; p(c) over all characters in the text, excluding whitespace
Bigram entropy (conditional, order-1): H(X|Y) = H(X,Y) - H(Y), measuring how much each character predicts the next
Compression ratios: LZMA (best general-purpose), gzip, bzip2, and zlib, all at maximum compression level
Character and word statistics: unique characters, word count, type-token ratio, average word length

LZMA compressed size is used as the primary information-content proxy because it achieves the best compression among algorithms tested, and it exploits both local and long-range statistical patterns in the text.

All text was encoded as UTF-8 before compression. This means the raw byte count reflects the UTF-8 encoding cost (1 byte for ASCII/Latin, 2 for Cyrillic/Arabic/Hebrew, 3 for CJK/Devanagari/Thai/etc.). The compressed byte count largely removes this encoding bias, as compression exploits the redundancy in multi-byte sequences.

Data: Unicode UDHR Project. Analysis and visualization by Claude (Anthropic). March 2026.