UNICA IRIS Institutional Research Information System

The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.

Curating global datasets of structural linguistic features for independence

Graff, Anna;Chousou-Polydouri, Natalia;Inman, David;Skirgård, Hedvig;Lischka, Marc;Zakharko, Taras;Barbieri, Chiara^{Conceptualization};Bickel, Balthasar

2025-01-01

Abstract

The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno di pubblicazione

2025

Tipologia:

1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
GraffScientificData2025.pdf accesso aperto Descrizione: Articolo principale Tipologia: versione editoriale (VoR) Dimensione 4.46 MB Formato Adobe PDF Visualizza/Apri	4.46 MB	Adobe PDF	Visualizza/Apri

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/434625

Citazioni

3

3

3

social impact