The literature in the area of the semi-supervised binary classification has demonstrated that useful information can be gathered not only from those samples whose class membership is known in advance, but also from the unlabelled ones. In fact, in the support vector machine, semi-supervised models with both labelled and unlabelled samples contribute to the definition of an appropriate optimization model for finding a good quality separating hyperplane. In particular, the optimization approaches which have been devised in this context are basically of two types: a mixed integer linear programming problem, and a continuous optimization problem characterized by an objective function which is nonsmooth and nonconvex. Both such problems are hard to solve whenever the number of the unlabelled points increases. In this article, we present a data preprocessing technique which has the objective of reducing the number of unlabelled points to enter the computational model, without worsening too much the classification performance of the overall process. The approach is based on the concept of separating sets and can be implemented with a reasonable computational effort. The results of the numerical experiments on several benchmark datasets are also reported. © 2011 Taylor & Francis.
Data preprocessing in semi-supervised SVM classification
GORGONE, ENRICO;
2011-01-01
Abstract
The literature in the area of the semi-supervised binary classification has demonstrated that useful information can be gathered not only from those samples whose class membership is known in advance, but also from the unlabelled ones. In fact, in the support vector machine, semi-supervised models with both labelled and unlabelled samples contribute to the definition of an appropriate optimization model for finding a good quality separating hyperplane. In particular, the optimization approaches which have been devised in this context are basically of two types: a mixed integer linear programming problem, and a continuous optimization problem characterized by an objective function which is nonsmooth and nonconvex. Both such problems are hard to solve whenever the number of the unlabelled points increases. In this article, we present a data preprocessing technique which has the objective of reducing the number of unlabelled points to enter the computational model, without worsening too much the classification performance of the overall process. The approach is based on the concept of separating sets and can be implemented with a reasonable computational effort. The results of the numerical experiments on several benchmark datasets are also reported. © 2011 Taylor & Francis.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.