This study introduces a statistical methodology for document clustering that integrates multiple dimensions of textual similarity through network topology analysis. The proposed methodology, which we call Multi-dimensional Similarity Network Analysis (MSNA), extends traditional document-clustering approaches by combining semantic embeddings, topic probability distributions, and emotional probability distribution into a unified similarity measure. We formalize this through a weighted combination of Jensen-Shannon divergences across different probability spaces, creating a comprehensive similarity network. The clustering is achieved through a community detection algorithm that optimizes a multi-objective modularity function, accounting for the different similarity dimensions. We prove the statistical consistency of our approach and derive bounds for the clustering performance under mild regularity conditions. The methodology is validated on a large-scale data set of Airbnb reviews from Sardinia, Italy, containing text content, topic distributions, and emotional features. Results show significant improvements in both clustering quality (average silhouette score increased) and interpretability compared to traditional single-dimension approaches. From an empirical perspective, the synthetic data validation demonstrates robust performance with topic strength in the range and emotion strength in , achieving mean Adjusted Rand Index scores of 0.44. The application to real-world data identifies five distinct clusters through PROCSIMA (PRObabilistic Clustering SIMilarity Analysis), with subsequent SMARTS (SeMantic Analysis of Review Topics and Sentiment) analysis revealing interpretable community structures within each cluster. The framework's ability to simultaneously capture text's semantic, thematic, and emotional aspects makes it particularly valuable for applications in customer experience analysis and service quality monitoring.
Topic‐Sentiment Hybrid Networks for Explainable Document Clustering: A Probabilistic Multi‐Dimensional Similarity Analysis
Ortu, Marco
Primo
2025-01-01
Abstract
This study introduces a statistical methodology for document clustering that integrates multiple dimensions of textual similarity through network topology analysis. The proposed methodology, which we call Multi-dimensional Similarity Network Analysis (MSNA), extends traditional document-clustering approaches by combining semantic embeddings, topic probability distributions, and emotional probability distribution into a unified similarity measure. We formalize this through a weighted combination of Jensen-Shannon divergences across different probability spaces, creating a comprehensive similarity network. The clustering is achieved through a community detection algorithm that optimizes a multi-objective modularity function, accounting for the different similarity dimensions. We prove the statistical consistency of our approach and derive bounds for the clustering performance under mild regularity conditions. The methodology is validated on a large-scale data set of Airbnb reviews from Sardinia, Italy, containing text content, topic distributions, and emotional features. Results show significant improvements in both clustering quality (average silhouette score increased) and interpretability compared to traditional single-dimension approaches. From an empirical perspective, the synthetic data validation demonstrates robust performance with topic strength in the range and emotion strength in , achieving mean Adjusted Rand Index scores of 0.44. The application to real-world data identifies five distinct clusters through PROCSIMA (PRObabilistic Clustering SIMilarity Analysis), with subsequent SMARTS (SeMantic Analysis of Review Topics and Sentiment) analysis revealing interpretable community structures within each cluster. The framework's ability to simultaneously capture text's semantic, thematic, and emotional aspects makes it particularly valuable for applications in customer experience analysis and service quality monitoring.| File | Dimensione | Formato | |
|---|---|---|---|
|
Appl+Stoch+Models+Bus+++Ind+-+2025+-+Ortu+-+Topic‐Sentiment+Hybrid+Networks+for+Explainable+Document+Clustering++A_compressed.pdf
accesso aperto
Tipologia:
versione editoriale (VoR)
Dimensione
1.89 MB
Formato
Adobe PDF
|
1.89 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


