This paper introduces the first steps towards the creation of a novel resource for contemporary Sardinian within the Universal Dependencies framework. Sardinian is a Romance language spoken in Sardinia, an island belonging to the Italian Republic and located in the center of the western Mediterranean. It is a minority and endangered language, traditionally transmitted mainly orally, and characterized by a multiplicity of varieties (usually grouped into two macro-varieties Logudorese and Campidanese), all recognized as part of the Sardinian linguistic continuum. These varieties share basic morphosyntactic features, while presenting differences at the lexical level and in the realization of specific constructions. This internal variation can be particularly challenging with regard to the normalization of lemmas and the linguistic characterization of certain phenomena. The development of the treebank therefore aims to provide an annotated resource for contemporary Sardinian that takes into account the specificities of the different varieties, using Universal Dependencies to represent them within a unified theoretical framework, in order to facilitate both linguistic analysis and automatic processing. The present paper thus describes some linguistic characteristics of Sardinian and the attempts to encode them within the UD framework. Finally, we present the results of our evaluation of an NLP pipeline for Sardinian, trained on our corpus, for the Stanford Stanza parser.

Introducing Universal Dependencies for Sardinian: the UD ContSar Treebank

Nicoletta Puddu;Manuela Sanguinetti;Luigi Talamo
2026-01-01

Abstract

This paper introduces the first steps towards the creation of a novel resource for contemporary Sardinian within the Universal Dependencies framework. Sardinian is a Romance language spoken in Sardinia, an island belonging to the Italian Republic and located in the center of the western Mediterranean. It is a minority and endangered language, traditionally transmitted mainly orally, and characterized by a multiplicity of varieties (usually grouped into two macro-varieties Logudorese and Campidanese), all recognized as part of the Sardinian linguistic continuum. These varieties share basic morphosyntactic features, while presenting differences at the lexical level and in the realization of specific constructions. This internal variation can be particularly challenging with regard to the normalization of lemmas and the linguistic characterization of certain phenomena. The development of the treebank therefore aims to provide an annotated resource for contemporary Sardinian that takes into account the specificities of the different varieties, using Universal Dependencies to represent them within a unified theoretical framework, in order to facilitate both linguistic analysis and automatic processing. The present paper thus describes some linguistic characteristics of Sardinian and the attempts to encode them within the UD framework. Finally, we present the results of our evaluation of an NLP pipeline for Sardinian, trained on our corpus, for the Stanford Stanza parser.
2026
978-2-493814-62-3
Universal dependencies; Syntax; Sardinian; Treebanks
File in questo prodotto:
File Dimensione Formato  
Puddu_Sanguinetti_Talamo_UD2026.pdf

accesso aperto

Tipologia: versione editoriale (VoR)
Dimensione 251.32 kB
Formato Adobe PDF
251.32 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/484965
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact