Data · Linguistics · 2025

Austronesian Vocabulary Analysis

Side project. I downloaded a public dataset of basic vocabulary across 2,000+ Austronesian languages, cleaned a slice of it, and built two interactive views: a family tree and a word-by-word similarity network.

Data · Linguistics Python CLDF pandas
Lexemes346K
Language varieties2,084
Subset analysed35 langs
Concepts18

What it is

The Austronesian Basic Vocabulary Database is a publicly published dataset of how a fixed list of basic words (water, sun, person, etc.) is spoken in over two thousand related languages. Linguists use that overlap to reconstruct family trees. I'm not a linguist — I downloaded it because it's a clean, well-structured comparative dataset and I wanted to see what falls out of it visually.

Two interactive views: a tree showing which 35 languages cluster together, and a network where you can scrub through individual words and watch the groupings shift.

How it works

The full dataset ships in CLDF format (Cross-Linguistic Data Formats) — every lexeme tied to a Glottolog language code, every gloss tied to a Concepticon concept set. I picked a 35-language, 18-concept slice for legibility and ran cognate-overlap distance between every language pair. UPGMA clustering gave the tree; cosine similarity over the same matrix gave the network.

The cognate matrix view colours by ancestral root — same colour in a column means same root word across languages — and lets you toggle a geographic projection so the clusters land on a map of the Pacific. Most of the work was data plumbing in pandas; the visualisations are vanilla D3 in a static HTML wrapper.

Fig. 1. UPGMA clustering of 35 Austronesian languages from cognate overlap across 18 basic vocabulary items. Toggle proportional vs equal branches. Drag to pan, scroll to zoom.
Fig. 2. Similarity network with adjustable edge threshold, plus the cognate matrix. Same color in a column = same ancestral root. Geographic View maps the clusters onto geography.

Where it's at

Done as a side project. The science is the original authors'; my contribution is the slice and the two views. The 35-language subset is hand-picked for legibility — the full 2,084 varieties would be unreadable in one frame. I'm not actively expanding it.