Android malware detectors increasingly rely on machine learning algorithms that are trained on datasets containing both benign (goodware) and malicious (malware) applications. These detectors have shown excellent results when the training and testing sets are collected over a fixed period. However, recent research indicates that the domain is not static due to ongoing changes in applications, which can lead to a decline in detector performance over time. The most effective solution to maintain the performance of these detectors is to continuously retrain them to update their knowledge. However, labeling data can be costly, as each sample requires analysis by a specialist. One straightforward approach is to use Active learning (AL), implementing techniques that select a subset of the most informative samples to be labeled, and leave the rest unlabeled. Despite its potential, there have been few attempts to compare and evaluate existing AL methods. In our study, we test six benchmark strategies to evaluate and compare their effectiveness in the Android malware domain. Our results show that 10% of the data labeled using any of these methods is enough to achieve detector performance closely matching that of a fully supervised model. This confirms that AL can effectively counter concept drift while keeping labeling costs to a minimum.
A comparative analysis of active learning strategies for Android malware detection
Manca, Cristian;Minnei, Luca;Pintor, Maura;Brau, Fabio;Biggio, Battista
2025-01-01
Abstract
Android malware detectors increasingly rely on machine learning algorithms that are trained on datasets containing both benign (goodware) and malicious (malware) applications. These detectors have shown excellent results when the training and testing sets are collected over a fixed period. However, recent research indicates that the domain is not static due to ongoing changes in applications, which can lead to a decline in detector performance over time. The most effective solution to maintain the performance of these detectors is to continuously retrain them to update their knowledge. However, labeling data can be costly, as each sample requires analysis by a specialist. One straightforward approach is to use Active learning (AL), implementing techniques that select a subset of the most informative samples to be labeled, and leave the rest unlabeled. Despite its potential, there have been few attempts to compare and evaluate existing AL methods. In our study, we test six benchmark strategies to evaluate and compare their effectiveness in the Android malware domain. Our results show that 10% of the data labeled using any of these methods is enough to achieve detector performance closely matching that of a fully supervised model. This confirms that AL can effectively counter concept drift while keeping labeling costs to a minimum.| File | Dimensione | Formato | |
|---|---|---|---|
|
A_Comparative_Analysis_of_Active_Learning_Strategies_for_Android_Malware_Detection.pdf
Solo gestori archivio
Descrizione: VoR
Tipologia:
versione editoriale (VoR)
Dimensione
1.01 MB
Formato
Adobe PDF
|
1.01 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
|
A+Comparative+Analysis+of+Active+Learning__Iris.pdf
accesso aperto
Descrizione: AAM
Tipologia:
versione post-print (AAM)
Dimensione
289.65 kB
Formato
Adobe PDF
|
289.65 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


