A large body of literature has shown the beneficial impact of feature selection on the efficiency, interpretability, and generalization ability of machine learning models. Most of the existing studies, however, focus on the effectiveness of feature selection algorithms in identifying small subsets of predictive features, often neglecting the stability of the selection process, i.e., its robustness with respect to sample variation, which can be crucial for the actual exploitation of the results. In particular, little research has so far investigated the stability of feature selection methods in class-imbalanced domains, where some classes are underrepresented and any perturbation in the set of training records can strongly affect the final selection outcome. This work aims to investigate this important issue by studying the stability of different selection algorithms across high-dimensional datasets that present different levels of class imbalance. To this end, a methodological pipeline is discussed which allows a joint evaluation of the selection outcome both in terms of stability and final predictive performance. Although not exhaustive, our experiments provide very useful insight into which methods can be more stable on imbalanced data while still ensuring good generalization results.

Feature Selection on Imbalanced Domains: A Stability-Based Analysis

Pes B.
2023-01-01

Abstract

A large body of literature has shown the beneficial impact of feature selection on the efficiency, interpretability, and generalization ability of machine learning models. Most of the existing studies, however, focus on the effectiveness of feature selection algorithms in identifying small subsets of predictive features, often neglecting the stability of the selection process, i.e., its robustness with respect to sample variation, which can be crucial for the actual exploitation of the results. In particular, little research has so far investigated the stability of feature selection methods in class-imbalanced domains, where some classes are underrepresented and any perturbation in the set of training records can strongly affect the final selection outcome. This work aims to investigate this important issue by studying the stability of different selection algorithms across high-dimensional datasets that present different levels of class imbalance. To this end, a methodological pipeline is discussed which allows a joint evaluation of the selection outcome both in terms of stability and final predictive performance. Although not exhaustive, our experiments provide very useful insight into which methods can be more stable on imbalanced data while still ensuring good generalization results.
2023
9783031368189
9783031368196
Machine Learning, Feature Selection, Selection Stability, High-dimensional Data, Genomic Data, Class Imbalance
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/403505
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact