What is the Indus script?

The Indus Script

4,500 years. 1,916 inscriptions. Nobody can read them.
We ran the coupling tools. It’s not a language. It’s a barcode.

JIM’S OVERSIMPLIFICATION

The Indus Valley was the biggest civilization in the ancient world. 5 million people. Grid streets. Indoor plumbing. Standardized bricks. They traded with everyone from Mesopotamia to Central Asia. They left 1,916 inscriptions on stone seals — average length: 4 signs. Everyone’s been trying to “read” them like a language for 100 years. They can’t, and now we know why. You don’t read a QR code. You scan it. These are trade stamps. Business cards carved in stone. [TITLE] [NAME] [CITY]. Who made this, where it’s from, what’s in the box. That’s it. Four signs. Done. The reason nobody can crack the “language” is there’s no language to crack. You can’t determine someone’s native language from their LinkedIn profile.

K IN THIS DOMAIN

K = 0.30. Sequential coupling — how well the previous sign predicts the next. Higher than English text (0.05–0.15). Higher than the Voynich (0.17). Because labels are MORE formulaic than prose. [TITLE] [NAME] [CITY] uses the same structure every time. The coupling is in the FORMAT, not the content.

Why Nobody Can Read It

Average inscription: 4.4 signs. The longest known is 26. Most are 3–5.

For comparison: this sentence is 7 words. You couldn’t determine what language it was written in if you only saw 4 characters of it. That’s the Indus problem. There’s not enough grammar in 4 signs to reconstruct a language.

The Voynich Manuscript has 37,919 words — enough to crack the grammar, the morphology, the recipe structure. The Indus script has 4-sign stamps. That’s a name tag, not a novel.

What the Numbers Say

It’s not random

Conditional entropy: 3.232 bits (shuffled baseline: 4.613). K = 0.30. The next sign is 30% predicted by the previous sign. Random is 0%. This has real structure.

Zipf slope: −1.492 (R² = 0.956). Matches natural language Zipf distributions.

It’s not a full language

Mean length 4.4 signs. Way too short for sentences. 67 signs account for 80% of all usage — that’s a tiny working vocabulary. 33% of signs appear only once (lower than English’s 40–50%) — MORE repetitive than natural text.

565 repeated 3-sign phrases across 1,916 inscriptions = 29% phrase coverage. Massively formulaic. Consistent with titles, trade marks, ritual phrases.

It has position rules

7 terminal signs that always end inscriptions. 3 initial signs that always begin them. 215 content signs in between. Fixed format: [OPENER] [CONTENT] [CLOSER].

Same structure as a business card, a barcode, a URL. Format markers at the edges, data in the middle.

The Barcode Hypothesis

A trading civilization of 5 million people across 1,500+ sites needs standardized identification. Who sent this shipment? What’s in it? Who guarantees quality? Where does it go if there’s a problem?

Each seal says:

[INITIAL SIGN] + [CONTENT SIGNS] + [TERMINAL SIGN]
= [TITLE/CLASS] + [NAME/GOODS] + [CLAN/CITY]

The 7 terminal signs might be city codes — Mohenjo-daro, Harappa, Lothal, Dholavira, etc. The 3 initial signs might be title prefixes — merchant, priest, official. The content signs in between = the specific identity or goods.

584 unique signs with 4.4 per inscription = 10¹² possible combinations. Enough to uniquely label every person in a city of millions. Which is exactly what you’d need.

Comparison

System	Length	K	Type
Indus script	4.4 signs	0.30	Labels/stamps
Voynich MS	5.1 chars/word	0.17	Recipe notation
English text	4.7 chars/word	0.05–0.15	Natural language
QR code	variable	~0.40	Structured data

Indus K is between natural language and structured data encoding. It’s more formulaic than prose but less rigid than a pure code. A label system with some flexibility — exactly what a trade stamp would be.

What Was Killed

Killed

× Simple substitution for a known language — 100 years of trying, no solution. Not enough grammar.

× Random/decorative symbols — K=0.30, Zipf-compliant. Real structure.

× Full literary language — 4.4 signs per inscription is too short for sentences.

Survived

✓ Structured labeling system (K=0.30, positional constraints, formulaic repetition)

✓ Fixed format: [INITIAL] + [CONTENT] + [TERMINAL] (3+215+7 sign classes)

✓ Trade/identity function (found on seals, consistent with Bronze Age commerce)

✓ Cross-site consistency (same signs at Mohenjo-daro, Harappa, Lothal, Dholavira)

Open

• What language did the Indus people SPEAK? Labels don’t contain enough grammar to determine this.

• What do specific signs mean? Without a bilingual text (a Rosetta Stone), individual sign values remain unknown.

• Are the 7 terminal signs city codes? Testable: do they correlate with excavation site?

You don’t read a QR code. You scan it.
The seal stamps the clay. The buyer checks the seal.
It says who made this and where.
Four signs. Done.

The biggest ancient civilization left the shortest messages.
Not because they had nothing to say.
Because they said it in person and stamped the receipt.

Good will applied forward.

K IN THIS DOMAIN

K = 1 - (H_observed / H_shuffled) = 1 - (3.232 / 4.613) = 0.2994. Conditional entropy from Rao et al. 2009 (Science) and 2026 synthetic baseline analysis. Measures sequential coupling between adjacent signs. Higher than natural language text, lower than rigid codes. Consistent with semi-structured labeling.

1. Corpus Statistics

• 1,916 deduplicated inscriptions (2,511 raw, 595 exact duplicates removed)

• 584 unique sign types (ICIT G### coding)

• 11,110 total sign occurrences

• 52 archaeological sites

• Mean inscription length: 4.4 signs (σ = 2.0, range 2–17)

• 67 signs account for 80% of all usage (11.5% of sign types)

• Hapax fraction: 33.2% (194 of 584 signs appear only once)

2. Entropy Analysis

• Conditional entropy: 3.232 bits

• Shuffled null: 4.613 ± 0.015 bits

• Percentile vs shuffled: 0.000 (more constrained than all 1,000 permutations)

• Zipf slope: −1.492 (R² = 0.956)

• Block entropy scaling matches natural languages, not random or rigid sequences (Rao et al. 2009)

• Positional rigidity (Cramér’s V): 0.149 for top 10 frequent signs

3. Structural Classes

• 3 initial signs (beginning-position bias)

• 7 terminal signs (end-position bias)

• 215 content signs

• 12 bigram communities (Louvain clustering)

• 219 template families (minimum cluster size 2)

• 2,032 unique segmentation units

4. Cross-Site Consistency

Site	Inscriptions	Unique Signs	Mean Length
Mohenjo-daro	1,188	464	4.9
Harappa	957	333	3.9
Dholavira	74	116	4.2
Lothal	78	100	4.7

Sign usage is consistent across sites separated by 1,000+ km. A standardized system, not local invention.

5. References

• Rao, Yadav, Vahia, Joglekar, Adhikari & Mahadevan (2009), “Entropic Evidence for Linguistic Structure in the Indus Script,” Science 324:1165

• 2026 synthetic baseline analysis, arXiv:2604.17828

• Mahadevan (1977), The Indus Script: Texts, Concordance and Tables

• ICIT: Interactive Corpus of Indus Texts (4,537 objects, 19,616 sign occurrences)

• GUMP tools: K/R/E/T framework applied to published entropy data

GUMP — Research · Voynich · Linguistics · [email protected]