Statistical methods improve and change in time, in line with the continuous increasing complexity of the phenomena and the size of the information available. Different approaches are combined together in order to improve the ability to analyze the data and to identify the possible relationships among them. In particular, the advantage of supervised and unsupervised learning approaches have been joined together in the last decades. But there is also a third way, which is actually a halfway: the semi-supervised learning approach. This new approach has been applied on different methodologies, as for instance the cluster analysis. In fact, a variant of the traditional clustering paradigms has been proposed in order to obtain a better partitioning of the data, that considers and incorporates background knowledge. Different kinds of semi-supervised clusters have been identified in recent studies. Bair (2013) has classified them in three approaches: partial labeled data, cluster with constraints and cluster associated with an outcome variable. The last category is also the less developed in literature, and is generally used only in the medical domain. The aim of this thesis is to define a semi-supervised clustering method able to identify clusters that are similar with respect to a specific outcome variable. A new algorithm will be proposed, based on the combination of two different methodologies: the tree-based method and the community detection in networks. This algorithm is called Community Detection Tree-Based Algorithm for Semi- supervised Clustering (CTSC) and aims to define clusters that differ for the value of the response variable, and whose elements are similar for the same values of the outcome variable. Three phases compose the CTSC. An innovative element is introduced in the algorithm: the clustering problem is transformed into a community detection problem. In fact, the cluster analysis is realized through the community detection algorithm, following the statement of Arruda et al. (2012), de Oliveira et al. (2008), Granell et al (2011, 2012). Moreover, different combinations of trees and community detection algorithms will be studied to offer a useful tool to the study of diverse phenomena and datasets. In fact, CTSC can be applied on different research areas and on different datasets. This thesis is composed by four chapters. The first chapter is focused on the description of semi-supervised learning, to better define a framework for the present proposal. The second chapter is focused on the review of the literature related to the tree based methods, in order to evaluate the different algorithms and the splitting criteria. The third chapter is focused on the study of the community detection methods. Specifically, several researches on networks and complex networks will be analyzed. Particular attention will be dedicated to the study of the different methodologies for the identification of the communities inside the networks. Finally, the possible future developments of CTSC will be presented.

Community Detection Tree-Based Algorithm for semi-supervised clustering

CONTU, GIULIA
2019-02-27

Abstract

Statistical methods improve and change in time, in line with the continuous increasing complexity of the phenomena and the size of the information available. Different approaches are combined together in order to improve the ability to analyze the data and to identify the possible relationships among them. In particular, the advantage of supervised and unsupervised learning approaches have been joined together in the last decades. But there is also a third way, which is actually a halfway: the semi-supervised learning approach. This new approach has been applied on different methodologies, as for instance the cluster analysis. In fact, a variant of the traditional clustering paradigms has been proposed in order to obtain a better partitioning of the data, that considers and incorporates background knowledge. Different kinds of semi-supervised clusters have been identified in recent studies. Bair (2013) has classified them in three approaches: partial labeled data, cluster with constraints and cluster associated with an outcome variable. The last category is also the less developed in literature, and is generally used only in the medical domain. The aim of this thesis is to define a semi-supervised clustering method able to identify clusters that are similar with respect to a specific outcome variable. A new algorithm will be proposed, based on the combination of two different methodologies: the tree-based method and the community detection in networks. This algorithm is called Community Detection Tree-Based Algorithm for Semi- supervised Clustering (CTSC) and aims to define clusters that differ for the value of the response variable, and whose elements are similar for the same values of the outcome variable. Three phases compose the CTSC. An innovative element is introduced in the algorithm: the clustering problem is transformed into a community detection problem. In fact, the cluster analysis is realized through the community detection algorithm, following the statement of Arruda et al. (2012), de Oliveira et al. (2008), Granell et al (2011, 2012). Moreover, different combinations of trees and community detection algorithms will be studied to offer a useful tool to the study of diverse phenomena and datasets. In fact, CTSC can be applied on different research areas and on different datasets. This thesis is composed by four chapters. The first chapter is focused on the description of semi-supervised learning, to better define a framework for the present proposal. The second chapter is focused on the review of the literature related to the tree based methods, in order to evaluate the different algorithms and the splitting criteria. The third chapter is focused on the study of the community detection methods. Specifically, several researches on networks and complex networks will be analyzed. Particular attention will be dedicated to the study of the different methodologies for the identification of the communities inside the networks. Finally, the possible future developments of CTSC will be presented.
27-feb-2019
File in questo prodotto:
File Dimensione Formato  
tesi di dottorato_giulia contu.pdf

Open Access dal 30/08/2019

Descrizione: tesi di dottorato
Dimensione 11.45 MB
Formato Adobe PDF
11.45 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/261608
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact