This study presents PROCSIMA, a methodological approach to document clustering, that defines a similarity metric derived from Jensen-Shannon divergence, for measuring similarities between topic probability distributions obtained from Topic Modeling techniques, such as Latent Dirichlet Allocation (LDA). Unlike conventional approaches that allocate documents to a singular, most pertinent topic, PROCSIMA allocates the clustering of documents by considering their comprehensive topic distribution. By transforming the similarity matrix into an adjacency matrix and subsequently applying community detection algorithms it defines document clusters. Empirical validation on both synthetic and real-world datasets is performed by PROCSIMA by bootstrapping the optimal number of network communities to outperform traditional clustering methods.

PROCSIMA: probability distribution clustering using similarity matrix analysis

Ortu, Marco
2024-01-01

Abstract

This study presents PROCSIMA, a methodological approach to document clustering, that defines a similarity metric derived from Jensen-Shannon divergence, for measuring similarities between topic probability distributions obtained from Topic Modeling techniques, such as Latent Dirichlet Allocation (LDA). Unlike conventional approaches that allocate documents to a singular, most pertinent topic, PROCSIMA allocates the clustering of documents by considering their comprehensive topic distribution. By transforming the similarity matrix into an adjacency matrix and subsequently applying community detection algorithms it defines document clusters. Empirical validation on both synthetic and real-world datasets is performed by PROCSIMA by bootstrapping the optimal number of network communities to outperform traditional clustering methods.
2024
9788855096454
Document Clustering; Topic Modeling; Textual Similarity Metric
File in questo prodotto:
File Dimensione Formato  
SDS_2024.pdf

Solo gestori archivio

Dimensione 2.93 MB
Formato Adobe PDF
2.93 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/426903
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact