Android malware detectors increasingly rely on machine learning algorithms that are trained on datasets containing both benign (goodware) and malicious (malware) applications. These detectors have shown excellent results when the training and testing sets are collected over a fixed period. However, recent research indicates that the domain is not static due to ongoing changes in applications, which can lead to a decline in detector performance over time. The most effective solution to maintain the performance of these detectors is to continuously retrain them to update their knowledge. However, labeling data can be costly, as each sample requires analysis by a specialist. One straightforward approach is to use Active learning (AL), implementing techniques that select a subset of the most informative samples to be labeled, and leave the rest unlabeled. Despite its potential, there have been few attempts to compare and evaluate existing AL methods. In our study, we test six benchmark strategies to evaluate and compare their effectiveness in the Android malware domain. Our results show that 10% of the data labeled using any of these methods is enough to achieve detector performance closely matching that of a fully supervised model. This confirms that AL can effectively counter concept drift while keeping labeling costs to a minimum.

A comparative analysis of active learning strategies for Android malware detection

Manca, Cristian;Minnei, Luca;Pintor, Maura;Brau, Fabio;Biggio, Battista
2025-01-01

Abstract

Android malware detectors increasingly rely on machine learning algorithms that are trained on datasets containing both benign (goodware) and malicious (malware) applications. These detectors have shown excellent results when the training and testing sets are collected over a fixed period. However, recent research indicates that the domain is not static due to ongoing changes in applications, which can lead to a decline in detector performance over time. The most effective solution to maintain the performance of these detectors is to continuously retrain them to update their knowledge. However, labeling data can be costly, as each sample requires analysis by a specialist. One straightforward approach is to use Active learning (AL), implementing techniques that select a subset of the most informative samples to be labeled, and leave the rest unlabeled. Despite its potential, there have been few attempts to compare and evaluate existing AL methods. In our study, we test six benchmark strategies to evaluate and compare their effectiveness in the Android malware domain. Our results show that 10% of the data labeled using any of these methods is enough to achieve detector performance closely matching that of a fully supervised model. This confirms that AL can effectively counter concept drift while keeping labeling costs to a minimum.
2025
979-8-3315-8736-9
Active Learning; Machine Learning; Android; Cybersecurity
File in questo prodotto:
File Dimensione Formato  
A_Comparative_Analysis_of_Active_Learning_Strategies_for_Android_Malware_Detection.pdf

Solo gestori archivio

Descrizione: VoR
Tipologia: versione editoriale (VoR)
Dimensione 1.01 MB
Formato Adobe PDF
1.01 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
A+Comparative+Analysis+of+Active+Learning__Iris.pdf

accesso aperto

Descrizione: AAM
Tipologia: versione post-print (AAM)
Dimensione 289.65 kB
Formato Adobe PDF
289.65 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/469871
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact