This paper presents a novel methodological framework for discovering and analyzing topic relationships in document collections using Generalized Structured Component Analysis (GSCA). While traditional document clustering approaches often rely on dominant topic assignment or simple similarity measures, our method leverages the full probabilistic nature of topic distributions and uncovers complex structural relationships between topics. We propose an unsupervised approach that starts with a fully connected path matrix and systematically identifies significant relationships through a rigorous statistical procedure combining bootstrap-based validation, regularization, and out-of-bag prediction error assessment. The method accommodates optional covariates and provides a robust alternative to conventional clustering techniques. Our framework contributes to both the theoretical understanding of topic relationships and practical applications in document organization and knowledge discovery.
Generalized Structured Component Analysis for Topic Modeling
Ortu, Marco
2025-01-01
Abstract
This paper presents a novel methodological framework for discovering and analyzing topic relationships in document collections using Generalized Structured Component Analysis (GSCA). While traditional document clustering approaches often rely on dominant topic assignment or simple similarity measures, our method leverages the full probabilistic nature of topic distributions and uncovers complex structural relationships between topics. We propose an unsupervised approach that starts with a fully connected path matrix and systematically identifies significant relationships through a rigorous statistical procedure combining bootstrap-based validation, regularization, and out-of-bag prediction error assessment. The method accommodates optional covariates and provides a robust alternative to conventional clustering techniques. Our framework contributes to both the theoretical understanding of topic relationships and practical applications in document organization and knowledge discovery.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


