UNICA IRIS Institutional Research Information System

The exponential growth of unstructured documents generated daily underscores the urgent need to develop technologies to structure information effectively. Traditional Information Extraction (IE) models enable the transformation of textual data into structured formats (e.g., semantic triplets), facilitating efficient searches and uncovering hidden data insights. However, they require predefined ontologies and, often, extensive human efforts. On the other hand, Open IE tools extract information without any input knowledge, but they are limited in capturing entire and in-depth contexts. Furthermore, the state of the art presents a substantial discrepancy between the efforts carried out in English-centric methods and those in low-resource languages, such as Italian. Our study aims to address the aforementioned key challenges. To this end, we first define Open Named Information Extraction (ONIE), an approach that generalizes IE across diverse domains without requiring input ontologies and captures complex relationships. Then, we develop LLIMONIIE (Large Language Instructed Model for Open Named Italian Information Extraction), a novel end-to-end generative information extraction framework that leverages the capabilities of Large Language Models (LLMs) to perform ONIE from Italian documents, able to extract Named Entities and Open Relations uniformly. Furthermore, we devise an innovative dataset generation methodology to support our research. Finally, we release the code and dataset, contributing to the scientific community and the development of low-resource languages. Experiments demonstrate the potential of our proposal, achieving competitive results compared to the actual state of the art of Italian IE.

LLIMONIIE: Large Language Instructed Model for Open Named Italian Information Extraction

Piano L.;Pisu A.;Tiddia S. G.;Carta S.;Giuliani A.;Pompianu L.

2025-01-01

Abstract

The exponential growth of unstructured documents generated daily underscores the urgent need to develop technologies to structure information effectively. Traditional Information Extraction (IE) models enable the transformation of textual data into structured formats (e.g., semantic triplets), facilitating efficient searches and uncovering hidden data insights. However, they require predefined ontologies and, often, extensive human efforts. On the other hand, Open IE tools extract information without any input knowledge, but they are limited in capturing entire and in-depth contexts. Furthermore, the state of the art presents a substantial discrepancy between the efforts carried out in English-centric methods and those in low-resource languages, such as Italian. Our study aims to address the aforementioned key challenges. To this end, we first define Open Named Information Extraction (ONIE), an approach that generalizes IE across diverse domains without requiring input ontologies and captures complex relationships. Then, we develop LLIMONIIE (Large Language Instructed Model for Open Named Italian Information Extraction), a novel end-to-end generative information extraction framework that leverages the capabilities of Large Language Models (LLMs) to perform ONIE from Italian documents, able to extract Named Entities and Open Relations uniformly. Furthermore, we devise an innovative dataset generation methodology to support our research. Finally, we release the code and dataset, contributing to the scientific community and the development of low-resource languages. Experiments demonstrate the potential of our proposal, achieving competitive results compared to the actual state of the art of Italian IE.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Parole chiave
	
				Information extraction; Low resource language; OIE; Large language models; Instruction tuning
			
	Tipologia:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
LLIMONIIE Large Language Instructed Model for Open Named Italian Information Extraction.pdf accesso aperto Tipologia: versione editoriale (VoR) Dimensione 1.92 MB Formato Adobe PDF Visualizza/Apri	1.92 MB	Adobe PDF	Visualizza/Apri

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/457367

Citazioni

ND

2

2

ND

social impact