The increasing popularity of digital services provided by large enterprises has determined a proliferation of data products, determining several data management issues. Among them, the recognition of overlapping data products is a major challenge, since data product duplication can lead to inefficiencies, unnecessary overhead, and customer confusion. Previous works have proposed different methods to search for joinability or semantic similarity in relational database systems. Other recent works explored the use of Large Language Models (LLMs) for semantic similarity search in Big Data repositories. In this work, we propose a novel, lightweight technique to find possible duplications in data product catalogs, exploiting pre-trained transformer models and the Hungarian algorithm. With respect to previous works, our method does not incur the high computational overhead of LLMs, and is flexible enough to also detect partial duplication. Moreover, being based on metadata analysis, our system is applicable also to prospective or new data products that have not yet been instantiated with data. We conducted an experimental evaluation with a set of data products, measuring the accuracy of similarity detection against a gold standard and the computational cost. Experimental results support the effectiveness and efficiency of our solution.

Lightweight Semantic Similarity Search of Data Products

D'Ambrosio A.;Reforgiato Recupero D.
;
Riboni D.
2025-01-01

Abstract

The increasing popularity of digital services provided by large enterprises has determined a proliferation of data products, determining several data management issues. Among them, the recognition of overlapping data products is a major challenge, since data product duplication can lead to inefficiencies, unnecessary overhead, and customer confusion. Previous works have proposed different methods to search for joinability or semantic similarity in relational database systems. Other recent works explored the use of Large Language Models (LLMs) for semantic similarity search in Big Data repositories. In this work, we propose a novel, lightweight technique to find possible duplications in data product catalogs, exploiting pre-trained transformer models and the Hungarian algorithm. With respect to previous works, our method does not incur the high computational overhead of LLMs, and is flexible enough to also detect partial duplication. Moreover, being based on metadata analysis, our system is applicable also to prospective or new data products that have not yet been instantiated with data. We conducted an experimental evaluation with a set of data products, measuring the accuracy of similarity detection against a gold standard and the computational cost. Experimental results support the effectiveness and efficiency of our solution.
2025
data mesh
Data product duplication
metadata analysis
semantic similarity
transformer models
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/480185
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact