UNICA IRIS Institutional Research Information System

Android malware detectors increasingly rely on machine learning algorithms that are trained on datasets containing both benign (goodware) and malicious (malware) applications. These detectors have shown excellent results when the training and testing sets are collected over a fixed period. However, recent research indicates that the domain is not static due to ongoing changes in applications, which can lead to a decline in detector performance over time. The most effective solution to maintain the performance of these detectors is to continuously retrain them to update their knowledge. However, labeling data can be costly, as each sample requires analysis by a specialist. One straightforward approach is to use Active learning (AL), implementing techniques that select a subset of the most informative samples to be labeled, and leave the rest unlabeled. Despite its potential, there have been few attempts to compare and evaluate existing AL methods. In our study, we test six benchmark strategies to evaluate and compare their effectiveness in the Android malware domain. Our results show that 10% of the data labeled using any of these methods is enough to achieve detector performance closely matching that of a fully supervised model. This confirms that AL can effectively counter concept drift while keeping labeling costs to a minimum.

A comparative analysis of active learning strategies for Android malware detection

Manca, Cristian;Minnei, Luca;Pintor, Maura;Brau, Fabio;Biggio, Battista

2025-01-01

Abstract

Android malware detectors increasingly rely on machine learning algorithms that are trained on datasets containing both benign (goodware) and malicious (malware) applications. These detectors have shown excellent results when the training and testing sets are collected over a fixed period. However, recent research indicates that the domain is not static due to ongoing changes in applications, which can lead to a decline in detector performance over time. The most effective solution to maintain the performance of these detectors is to continuously retrain them to update their knowledge. However, labeling data can be costly, as each sample requires analysis by a specialist. One straightforward approach is to use Active learning (AL), implementing techniques that select a subset of the most informative samples to be labeled, and leave the rest unlabeled. Despite its potential, there have been few attempts to compare and evaluate existing AL methods. In our study, we test six benchmark strategies to evaluate and compare their effectiveness in the Android malware domain. Our results show that 10% of the data labeled using any of these methods is enough to achieve detector performance closely matching that of a fully supervised model. This confirms that AL can effectively counter concept drift while keeping labeling costs to a minimum.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Codice ISBN
	
				979-8-3315-8736-9
			
	Parole chiave
	
				Active Learning; Machine Learning; Android; Cybersecurity
			
	Tipologia:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
A_Comparative_Analysis_of_Active_Learning_Strategies_for_Android_Malware_Detection.pdf Solo gestori archivio Descrizione: VoR Tipologia: versione editoriale (VoR) Dimensione 1.01 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.01 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
A+Comparative+Analysis+of+Active+Learning__Iris.pdf accesso aperto Descrizione: AAM Tipologia: versione post-print (AAM) Dimensione 289.65 kB Formato Adobe PDF Visualizza/Apri	289.65 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/469871

Citazioni

ND

0

ND

social impact