The evaluation of the intrinsic complexity of a supervised domain plays an important role in devising classification systems. Typically, the metrics used for this purpose produce an overall evaluation of the domain, without localizing the sources of complexity. In this work we propose a method for partitioning the feature space into subsets of different complexity. The most important outcome of the method is the possibility of preliminarily identifying hard and easy regions of the feature space. This possibility opens interesting theoretical and pragmatic scenarios, including the analysis of the classification error and the implementation of robust classification systems. A first group of experiments has been performed on synthetic datasets, devised to separately highlight specific and recurrent problems often found in real-world domains. In particular, the focus has been on class boundaries, noise, and density of samples. A second group of experiments, performed on selected real-world datasets, confirm the validity of the proposed method. The ultimate goal of our research is to devise a method for estimating the classification difficulty of a dataset. The proposed method makes a significant step in this direction, as it is able to partition a given dataset according to the inherent complexity of the samples contained therein.
A Novel Method for Partitioning Feature Spaces According to their Inherent Classification Complexity
ARMANO, GIULIANO;
2013-01-01
Abstract
The evaluation of the intrinsic complexity of a supervised domain plays an important role in devising classification systems. Typically, the metrics used for this purpose produce an overall evaluation of the domain, without localizing the sources of complexity. In this work we propose a method for partitioning the feature space into subsets of different complexity. The most important outcome of the method is the possibility of preliminarily identifying hard and easy regions of the feature space. This possibility opens interesting theoretical and pragmatic scenarios, including the analysis of the classification error and the implementation of robust classification systems. A first group of experiments has been performed on synthetic datasets, devised to separately highlight specific and recurrent problems often found in real-world domains. In particular, the focus has been on class boundaries, noise, and density of samples. A second group of experiments, performed on selected real-world datasets, confirm the validity of the proposed method. The ultimate goal of our research is to devise a method for estimating the classification difficulty of a dataset. The proposed method makes a significant step in this direction, as it is able to partition a given dataset according to the inherent complexity of the samples contained therein.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.