Semisupervised clustering extends standard clustering methods to the semisupervised setting, in some cases consideringsituations when clusters are associated with a given outcome variable that acts as a “noisy surrogate,” that is a good proxy of the unknown clustering structure. In this article, a novel approach to semisupervised clustering associated with an outcome variable named network-based semisupervised clustering (NeSSC) is introduced. It combines an initialization, a training and an agglomeration phase. In the initialization and training a matrix of pairwise affinity of the instances is estimated by a classifier. In the agglomeration phase the matrix of pairwise affinity is transformed into a complex network, in which a community detection algorithm searches the underlying community structure. Thus, a partition of the instances into clusters highly homogeneous in terms of the outcome is obtained. We consider a particular specification of NeSSC that uses classification or regression trees as classifiers and the Louvain, Label propagation and Walktrap as possible community detection algorithm. NeSSC’s stopping criterion and the choice of the optimal partition of the original data are also discussed. Several applications on both real and simulated data are presented to demonstrate the effectiveness of the proposed semisupervised clustering method and the benefits it provides in terms of improved interpretability of results with respect to three alternative semisupervised clustering methods.

Network-based semisupervised clustering

Frigau, L.
Methodology
;
Contu, G.;Mola, F.
Supervision
;
Conversano, C.
Methodology
2021-01-01

Abstract

Semisupervised clustering extends standard clustering methods to the semisupervised setting, in some cases consideringsituations when clusters are associated with a given outcome variable that acts as a “noisy surrogate,” that is a good proxy of the unknown clustering structure. In this article, a novel approach to semisupervised clustering associated with an outcome variable named network-based semisupervised clustering (NeSSC) is introduced. It combines an initialization, a training and an agglomeration phase. In the initialization and training a matrix of pairwise affinity of the instances is estimated by a classifier. In the agglomeration phase the matrix of pairwise affinity is transformed into a complex network, in which a community detection algorithm searches the underlying community structure. Thus, a partition of the instances into clusters highly homogeneous in terms of the outcome is obtained. We consider a particular specification of NeSSC that uses classification or regression trees as classifiers and the Louvain, Label propagation and Walktrap as possible community detection algorithm. NeSSC’s stopping criterion and the choice of the optimal partition of the original data are also discussed. Several applications on both real and simulated data are presented to demonstrate the effectiveness of the proposed semisupervised clustering method and the benefits it provides in terms of improved interpretability of results with respect to three alternative semisupervised clustering methods.
2021
Community detection; complex networks; label propagation; Louvain; tree-based classifiers; Walktrap
File in questo prodotto:
File Dimensione Formato  
asmb.2618 (1).pdf

Solo gestori archivio

Tipologia: versione post-print
Dimensione 1.18 MB
Formato Adobe PDF
1.18 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/311843
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact