UNICA IRIS Institutional Research Information System

The increasing popularity of digital services provided by large enterprises has determined a proliferation of data products, determining several data management issues. Among them, the recognition of overlapping data products is a major challenge, since data product duplication can lead to inefficiencies, unnecessary overhead, and customer confusion. Previous works have proposed different methods to search for joinability or semantic similarity in relational database systems. Other recent works explored the use of Large Language Models (LLMs) for semantic similarity search in Big Data repositories. In this work, we propose a novel, lightweight technique to find possible duplications in data product catalogs, exploiting pre-trained transformer models and the Hungarian algorithm. With respect to previous works, our method does not incur the high computational overhead of LLMs, and is flexible enough to also detect partial duplication. Moreover, being based on metadata analysis, our system is applicable also to prospective or new data products that have not yet been instantiated with data. We conducted an experimental evaluation with a set of data products, measuring the accuracy of similarity detection against a gold standard and the computational cost. Experimental results support the effectiveness and efficiency of our solution.

Lightweight Semantic Similarity Search of Data Products

D'Ambrosio A.;Platter P.;Salis M.;Simbola F.;Reforgiato Recupero D.;Riboni D.

2025-01-01

Abstract

The increasing popularity of digital services provided by large enterprises has determined a proliferation of data products, determining several data management issues. Among them, the recognition of overlapping data products is a major challenge, since data product duplication can lead to inefficiencies, unnecessary overhead, and customer confusion. Previous works have proposed different methods to search for joinability or semantic similarity in relational database systems. Other recent works explored the use of Large Language Models (LLMs) for semantic similarity search in Big Data repositories. In this work, we propose a novel, lightweight technique to find possible duplications in data product catalogs, exploiting pre-trained transformer models and the Hungarian algorithm. With respect to previous works, our method does not incur the high computational overhead of LLMs, and is flexible enough to also detect partial duplication. Moreover, being based on metadata analysis, our system is applicable also to prospective or new data products that have not yet been instantiated with data. We conducted an experimental evaluation with a set of data products, measuring the accuracy of similarity detection against a gold standard and the computational cost. Experimental results support the effectiveness and efficiency of our solution.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Parole chiave
	
				data mesh; Data product duplication; metadata analysis; semantic similarity; transformer models
			
	Tipologia:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Lightweight_Semantic_Similarity_Search_of_Data_Products.pdf accesso aperto Tipologia: versione editoriale (VoR) Dimensione 1.82 MB Formato Adobe PDF Visualizza/Apri	1.82 MB	Adobe PDF	Visualizza/Apri

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/480185

Citazioni

ND

0

0

ND

social impact