Data · Linguistics · 2025
Austronesian Vocabulary Analysis
Side project. I downloaded a public dataset of basic vocabulary across 2,000+ Austronesian languages, cleaned a slice of it, and built two interactive views: a family tree and a word-by-word similarity network.
What it is
The Austronesian Basic Vocabulary Database is a publicly published dataset of how a fixed list of basic words (water, sun, person, etc.) is spoken in over two thousand related languages. Linguists use that overlap to reconstruct family trees. I'm not a linguist — I downloaded it because it's a clean, well-structured comparative dataset and I wanted to see what falls out of it visually.
Two interactive views: a tree showing which 35 languages cluster together, and a network where you can scrub through individual words and watch the groupings shift.
How it works
The full dataset ships in CLDF format (Cross-Linguistic Data Formats) — every lexeme tied to a Glottolog language code, every gloss tied to a Concepticon concept set. I picked a 35-language, 18-concept slice for legibility and ran cognate-overlap distance between every language pair. UPGMA clustering gave the tree; cosine similarity over the same matrix gave the network.
The cognate matrix view colours by ancestral root — same colour in a column means same root word across languages — and lets you toggle a geographic projection so the clusters land on a map of the Pacific. Most of the work was data plumbing in pandas; the visualisations are vanilla D3 in a static HTML wrapper.
Where it's at
Done as a side project. The science is the original authors'; my contribution is the slice and the two views. The 35-language subset is hand-picked for legibility — the full 2,084 varieties would be unreadable in one frame. I'm not actively expanding it.