Most real-world pattern recognition problems are too complex to be efficiently handled using standard classification methods. Large number of classes or feature vectors, dataset sparsity, classes having high overlap, low number of samples available, and the need for online-training and classification are just some of the complexity issues that should be considered while designing a classification system. Some important examples that come from real-world applications are text categorization (characterized by large number of categories and words, sparsity of data and hierarchical relationships between class concepts) and cancer cell diognosis based on gene expression microarray information (HDLSS1 problem). Combining classifiers (in pattern recognition) or ensemble learning (in machine learning), based on the divide and conquer principle, has proved to be efficient in many of these complex situations. Ensemble methods such as bagging, boosting, Error-Correcting Output Codes (ECOC), Mixture of Experts (ME) and random forests use a combination of simple classifiers, working in a cooperative manner, instead of a single classifier responsible for the entire task. These approaches are able to obtain better recognition rates (bias reduction) and furthermore stabilize predictions (variance reduction). This PhD thesis mainly focuses on theoretical and practical aspects of ensemble learning and multiple classifier systems. The novelty comes from developing new ideas by extending some classical approaches and standard algorithms, such as ME, random oracles, and ECOC. Two newer versions of ME, HME and random oracle have been proposed with the result of boosting their accuracy and efficiency. The standard ECOC method has also been extended, giving rise to the Multi-Label ECOC (ML-ECOC, hereinafter). The proposed ideas and methods have been assessed not only using publicly available benchmark datasets (from the UCI repository), as some real-world application areas have also been used for experiments. The thesis is organized as follows: in chapter 1, a quick introduction to multiple classifier systems is given and some important algorithms described in the literature (such as bagging, boosting, random oracle, ME and ECOC) are briefly recalled. The main characteristics, pros, and cons of these algorithms different algorithms are also reported (this information may be helpful to identify which methods are expected to work better for a given problem). Chapter 2 describes the proposed Random Prototype-based Oracle (RPO) method, which is an ensemble of miniensembles. Inspired by the random linear oracle model, the proposed method divides the problem space (i.e., the sample space) into smaller and hopefully simpler subspaces using randomly selected prototype points, rather than using hyperplanes as done with standard linear oracles. RPO has the advantage of decomposing the orginal problem into two or more subspaces, whereas linear oracles have the limitation of enforcing only binary decomposition. Continuing with the idea of random decomposition of a complex problem, chapter 3 proposes Mixture of Random Prototype-based Experts (MRPE), together with its hierarchical version. Embedding a random prototype for each expert in the ME framework is the main idea of this method. In so doing, a simple distance-based rule can be used in both training and operation phases of an ME ensemble instead of a trainable gating network that needs to be trained together with the rest of experts. This simple modification boosts accuracy while reducing the overall time required for training the ensemble. Finally, chapter 4 is about ECOC, applied to a text categorization problem. In the first subsection, we propose a metric extracted from ECOC decoding to better evaluate the label assigned by the classifier. This ECOC-based reliability measure can be used to increase the confidence of the classifier’s output on the inputs with high risk of mislabeling. The second part of the chapter extends the ECOC algorithm to multi-label problems. To validate the proposed ML-ECOC and compare its performance with the state-of-the-art methods, we apply the ML-ECOC on the real-world problem of multi-label text categorization.
Some proposals for combining ensemble classifiers
-
2012-03-06
Abstract
Most real-world pattern recognition problems are too complex to be efficiently handled using standard classification methods. Large number of classes or feature vectors, dataset sparsity, classes having high overlap, low number of samples available, and the need for online-training and classification are just some of the complexity issues that should be considered while designing a classification system. Some important examples that come from real-world applications are text categorization (characterized by large number of categories and words, sparsity of data and hierarchical relationships between class concepts) and cancer cell diognosis based on gene expression microarray information (HDLSS1 problem). Combining classifiers (in pattern recognition) or ensemble learning (in machine learning), based on the divide and conquer principle, has proved to be efficient in many of these complex situations. Ensemble methods such as bagging, boosting, Error-Correcting Output Codes (ECOC), Mixture of Experts (ME) and random forests use a combination of simple classifiers, working in a cooperative manner, instead of a single classifier responsible for the entire task. These approaches are able to obtain better recognition rates (bias reduction) and furthermore stabilize predictions (variance reduction). This PhD thesis mainly focuses on theoretical and practical aspects of ensemble learning and multiple classifier systems. The novelty comes from developing new ideas by extending some classical approaches and standard algorithms, such as ME, random oracles, and ECOC. Two newer versions of ME, HME and random oracle have been proposed with the result of boosting their accuracy and efficiency. The standard ECOC method has also been extended, giving rise to the Multi-Label ECOC (ML-ECOC, hereinafter). The proposed ideas and methods have been assessed not only using publicly available benchmark datasets (from the UCI repository), as some real-world application areas have also been used for experiments. The thesis is organized as follows: in chapter 1, a quick introduction to multiple classifier systems is given and some important algorithms described in the literature (such as bagging, boosting, random oracle, ME and ECOC) are briefly recalled. The main characteristics, pros, and cons of these algorithms different algorithms are also reported (this information may be helpful to identify which methods are expected to work better for a given problem). Chapter 2 describes the proposed Random Prototype-based Oracle (RPO) method, which is an ensemble of miniensembles. Inspired by the random linear oracle model, the proposed method divides the problem space (i.e., the sample space) into smaller and hopefully simpler subspaces using randomly selected prototype points, rather than using hyperplanes as done with standard linear oracles. RPO has the advantage of decomposing the orginal problem into two or more subspaces, whereas linear oracles have the limitation of enforcing only binary decomposition. Continuing with the idea of random decomposition of a complex problem, chapter 3 proposes Mixture of Random Prototype-based Experts (MRPE), together with its hierarchical version. Embedding a random prototype for each expert in the ME framework is the main idea of this method. In so doing, a simple distance-based rule can be used in both training and operation phases of an ME ensemble instead of a trainable gating network that needs to be trained together with the rest of experts. This simple modification boosts accuracy while reducing the overall time required for training the ensemble. Finally, chapter 4 is about ECOC, applied to a text categorization problem. In the first subsection, we propose a metric extracted from ECOC decoding to better evaluate the label assigned by the classifier. This ECOC-based reliability measure can be used to increase the confidence of the classifier’s output on the inputs with high risk of mislabeling. The second part of the chapter extends the ECOC algorithm to multi-label problems. To validate the proposed ML-ECOC and compare its performance with the state-of-the-art methods, we apply the ML-ECOC on the real-world problem of multi-label text categorization.File | Dimensione | Formato | |
---|---|---|---|
PhD_Hatami_Nima.pdf
accesso aperto
Tipologia:
Tesi di dottorato
Dimensione
1.04 MB
Formato
Adobe PDF
|
1.04 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.