UNICA IRIS Institutional Research Information System

As witnessed by a vast corpus of literature, dimensionality reduction is a fundamental step for biomedical data analysis. Indeed, in this domain, there is often the need for coping with a huge number of data attributes (or features). By removing irrelevant or redundant attributes, feature selection techniques can significantly reduce the complexity of the original problem, with important benefits in terms of domain understanding and knowledge discovery. When learning from biomedical data, however, the dimensionality issue is often addressed without a joint consideration of other critical aspects that may compromise the performance of the induced models. The adverse implications of an imbalanced class distribution, for example, are often neglected in this domain. The aim of this work is to investigate the effectiveness of hybrid learning strategies that incorporate both methods for dimensionality reduction as well as methods for alleviating the issue of class imbalance. Specifically, we combine different feature selection techniques, both univariate and multivariate, with sampling-based class balancing methods and cost-sensitive classification. The performance of the resulting learning schemes is experimentally evaluated on six high-dimensional genomic benchmarks, using different classification algorithms, with interesting insight about the best strategies to use based on the characteristics of the data at hand.

Learning from high-dimensional biomedical datasets: The issue of class imbalance

Pes B.^Primo

2020-01-01

Abstract

As witnessed by a vast corpus of literature, dimensionality reduction is a fundamental step for biomedical data analysis. Indeed, in this domain, there is often the need for coping with a huge number of data attributes (or features). By removing irrelevant or redundant attributes, feature selection techniques can significantly reduce the complexity of the original problem, with important benefits in terms of domain understanding and knowledge discovery. When learning from biomedical data, however, the dimensionality issue is often addressed without a joint consideration of other critical aspects that may compromise the performance of the induced models. The adverse implications of an imbalanced class distribution, for example, are often neglected in this domain. The aim of this work is to investigate the effectiveness of hybrid learning strategies that incorporate both methods for dimensionality reduction as well as methods for alleviating the issue of class imbalance. Specifically, we combine different feature selection techniques, both univariate and multivariate, with sampling-based class balancing methods and cost-sensitive classification. The performance of the resulting learning schemes is experimentally evaluated on six high-dimensional genomic benchmarks, using different classification algorithms, with interesting insight about the best strategies to use based on the characteristics of the data at hand.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Parole chiave
	
				Bioinformatics; Class imbalance; Cost-sensitive classification; Feature selection; High-dimensional data analysis; Random forest; Random under-sampling; Smote over-sampling
			
	Tipologia:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
ACCESS_published_final.pdf accesso aperto Descrizione: Articolo principale Tipologia: versione editoriale (VoR) Dimensione 6.38 MB Formato Adobe PDF Visualizza/Apri	6.38 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/287430

Citazioni

ND

34

30

social impact