UNICA IRIS Institutional Research Information System

In recent years, the wide use of the Android operating system for mobile devices has encouraged a likewise increasing number of cyber-attackers, which exploit related vulnerabilities to create Android Malware. While these represent a major threat in the security landscape, it has been shown how machine learning algorithms, trained over a collection of goodware and malware data, can effectively detect their presence. However, the domain in which such data lies changes over time due to the evolution of applications, such as software updates or deprecation of API calls, and the amount of malware and goodware examples are typically imbalanced. Hence, while machine-learning detectors are effective solutions, their performance must keep up with domain evolution and class imbalance, which can, however, result in frequent expensive retraining. In this work, we perform a preliminary experimental investigation of semi-supervised learning to retrain machine learning-based malware detectors using pseudo-labels along with a small pool of labeled samples. In detail, we account for class imbalance by considering self-training with class-specific thresholds. Our results show that we improve the classification performances by using approximately 10% of pseudo labels in each re-training round.

An Experimental Analysis of Semi-supervised Learning for Malware Detection

Luca Minnei;Giorgio Piras;Angelo Sotgiu;Maura Pintor;Ambra Demontis;Davide Maiorca;Battista Biggio

2025-01-01

Abstract

In recent years, the wide use of the Android operating system for mobile devices has encouraged a likewise increasing number of cyber-attackers, which exploit related vulnerabilities to create Android Malware. While these represent a major threat in the security landscape, it has been shown how machine learning algorithms, trained over a collection of goodware and malware data, can effectively detect their presence. However, the domain in which such data lies changes over time due to the evolution of applications, such as software updates or deprecation of API calls, and the amount of malware and goodware examples are typically imbalanced. Hence, while machine-learning detectors are effective solutions, their performance must keep up with domain evolution and class imbalance, which can, however, result in frequent expensive retraining. In this work, we perform a preliminary experimental investigation of semi-supervised learning to retrain machine learning-based malware detectors using pseudo-labels along with a small pool of labeled samples. In detail, we account for class imbalance by considering self-training with class-specific thresholds. Our results show that we improve the classification performances by using approximately 10% of pseudo labels in each re-training round.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Parole chiave
	
				Semi-Supervised Learning; Android Malware Detection; Concept Drift, Active Learning
			
	Tipologia:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
paper57.pdf accesso aperto Descrizione: Versione Editoriale Tipologia: versione editoriale (VoR) Dimensione 1.09 MB Formato Adobe PDF Visualizza/Apri	1.09 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/443967

Citazioni

ND

ND

ND

social impact