Austronesian Vocabulary Analysis

What it is

The Austronesian Basic Vocabulary Database is a publicly published dataset of how a fixed list of basic words (water, sun, person, etc.) is spoken in over two thousand related languages. Linguists use that overlap to reconstruct family trees. I'm not a linguist — I downloaded it because it's a clean, well-structured comparative dataset and I wanted to see what falls out of it visually.

Two interactive views: a tree showing which 35 languages cluster together, and a network where you can scrub through individual words and watch the groupings shift.

How it works

The full dataset ships in CLDF format (Cross-Linguistic Data Formats) — every lexeme tied to a Glottolog language code, every gloss tied to a Concepticon concept set. I picked a 35-language, 18-concept slice for legibility and ran cognate-overlap distance between every language pair. UPGMA clustering gave the tree; cosine similarity over the same matrix gave the network.

The cognate matrix view colours by ancestral root — same colour in a column means same root word across languages — and lets you toggle a geographic projection so the clusters land on a map of the Pacific. Most of the work was data plumbing in pandas; the visualisations are vanilla D3 in a static HTML wrapper.

Where it's at

Done as a side project. The science is the original authors'; my contribution is the slice and the two views. The 35-language subset is hand-picked for legibility — the full 2,084 varieties would be unreadable in one frame. I'm not actively expanding it.