The exponential growth of unstructured documents generated daily underscores the urgent need to develop technologies to structure information effectively. Traditional Information Extraction (IE) models enable the transformation of textual data into structured formats (e.g., semantic triplets), facilitating efficient searches and uncovering hidden data insights. However, they require predefined ontologies and, often, extensive human efforts. On the other hand, Open IE tools extract information without any input knowledge, but they are limited in capturing entire and in-depth contexts. Furthermore, the state of the art presents a substantial discrepancy between the efforts carried out in English-centric methods and those in low-resource languages, such as Italian. Our study aims to address the aforementioned key challenges. To this end, we first define Open Named Information Extraction (ONIE), an approach that generalizes IE across diverse domains without requiring input ontologies and captures complex relationships. Then, we develop LLIMONIIE (Large Language Instructed Model for Open Named Italian Information Extraction), a novel end-to-end generative information extraction framework that leverages the capabilities of Large Language Models (LLMs) to perform ONIE from Italian documents, able to extract Named Entities and Open Relations uniformly. Furthermore, we devise an innovative dataset generation methodology to support our research. Finally, we release the code and dataset, contributing to the scientific community and the development of low-resource languages. Experiments demonstrate the potential of our proposal, achieving competitive results compared to the actual state of the art of Italian IE.
LLIMONIIE: Large Language Instructed Model for Open Named Italian Information Extraction
Piano L.;Pisu A.;Tiddia S. G.;Carta S.;Giuliani A.;Pompianu L.
2025-01-01
Abstract
The exponential growth of unstructured documents generated daily underscores the urgent need to develop technologies to structure information effectively. Traditional Information Extraction (IE) models enable the transformation of textual data into structured formats (e.g., semantic triplets), facilitating efficient searches and uncovering hidden data insights. However, they require predefined ontologies and, often, extensive human efforts. On the other hand, Open IE tools extract information without any input knowledge, but they are limited in capturing entire and in-depth contexts. Furthermore, the state of the art presents a substantial discrepancy between the efforts carried out in English-centric methods and those in low-resource languages, such as Italian. Our study aims to address the aforementioned key challenges. To this end, we first define Open Named Information Extraction (ONIE), an approach that generalizes IE across diverse domains without requiring input ontologies and captures complex relationships. Then, we develop LLIMONIIE (Large Language Instructed Model for Open Named Italian Information Extraction), a novel end-to-end generative information extraction framework that leverages the capabilities of Large Language Models (LLMs) to perform ONIE from Italian documents, able to extract Named Entities and Open Relations uniformly. Furthermore, we devise an innovative dataset generation methodology to support our research. Finally, we release the code and dataset, contributing to the scientific community and the development of low-resource languages. Experiments demonstrate the potential of our proposal, achieving competitive results compared to the actual state of the art of Italian IE.| File | Dimensione | Formato | |
|---|---|---|---|
|
LLIMONIIE Large Language Instructed Model for Open Named Italian Information Extraction.pdf
accesso aperto
Tipologia:
versione editoriale (VoR)
Dimensione
1.92 MB
Formato
Adobe PDF
|
1.92 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


