This paper presents a novel methodological framework for discovering and analyzing topic relationships in document collections using Generalized Structured Component Analysis (GSCA). While traditional document clustering approaches often rely on dominant topic assignment or simple similarity measures, our method leverages the full probabilistic nature of topic distributions and uncovers complex structural relationships between topics. We propose an unsupervised approach that starts with a fully connected path matrix and systematically identifies significant relationships through a rigorous statistical procedure combining bootstrap-based validation, regularization, and out-of-bag prediction error assessment. The method accommodates optional covariates and provides a robust alternative to conventional clustering techniques. Our framework contributes to both the theoretical understanding of topic relationships and practical applications in document organization and knowledge discovery.

Generalized Structured Component Analysis for Topic Modeling

Ortu, Marco
2025-01-01

Abstract

This paper presents a novel methodological framework for discovering and analyzing topic relationships in document collections using Generalized Structured Component Analysis (GSCA). While traditional document clustering approaches often rely on dominant topic assignment or simple similarity measures, our method leverages the full probabilistic nature of topic distributions and uncovers complex structural relationships between topics. We propose an unsupervised approach that starts with a fully connected path matrix and systematically identifies significant relationships through a rigorous statistical procedure combining bootstrap-based validation, regularization, and out-of-bag prediction error assessment. The method accommodates optional covariates and provides a robust alternative to conventional clustering techniques. Our framework contributes to both the theoretical understanding of topic relationships and practical applications in document organization and knowledge discovery.
2025
978-3-032-03041-2
Structural Topic Modeling, General Component Analysis, Document Clustering
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/454466
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact