In recent years, the wide use of the Android operating system for mobile devices has encouraged a likewise increasing number of cyber-attackers, which exploit related vulnerabilities to create Android Malware. While these represent a major threat in the security landscape, it has been shown how machine learning algorithms, trained over a collection of goodware and malware data, can effectively detect their presence. However, the domain in which such data lies changes over time due to the evolution of applications, such as software updates or deprecation of API calls, and the amount of malware and goodware examples are typically imbalanced. Hence, while machine-learning detectors are effective solutions, their performance must keep up with domain evolution and class imbalance, which can, however, result in frequent expensive retraining. In this work, we perform a preliminary experimental investigation of semi-supervised learning to retrain machine learning-based malware detectors using pseudo-labels along with a small pool of labeled samples. In detail, we account for class imbalance by considering self-training with class-specific thresholds. Our results show that we improve the classification performances by using approximately 10% of pseudo labels in each re-training round.
An Experimental Analysis of Semi-supervised Learning for Malware Detection
Luca Minnei;Giorgio Piras;Angelo Sotgiu;Maura Pintor;Ambra Demontis;Davide Maiorca;Battista Biggio
2025-01-01
Abstract
In recent years, the wide use of the Android operating system for mobile devices has encouraged a likewise increasing number of cyber-attackers, which exploit related vulnerabilities to create Android Malware. While these represent a major threat in the security landscape, it has been shown how machine learning algorithms, trained over a collection of goodware and malware data, can effectively detect their presence. However, the domain in which such data lies changes over time due to the evolution of applications, such as software updates or deprecation of API calls, and the amount of malware and goodware examples are typically imbalanced. Hence, while machine-learning detectors are effective solutions, their performance must keep up with domain evolution and class imbalance, which can, however, result in frequent expensive retraining. In this work, we perform a preliminary experimental investigation of semi-supervised learning to retrain machine learning-based malware detectors using pseudo-labels along with a small pool of labeled samples. In detail, we account for class imbalance by considering self-training with class-specific thresholds. Our results show that we improve the classification performances by using approximately 10% of pseudo labels in each re-training round.File | Dimensione | Formato | |
---|---|---|---|
paper57.pdf
accesso aperto
Descrizione: Versione Editoriale
Tipologia:
versione editoriale (VoR)
Dimensione
1.09 MB
Formato
Adobe PDF
|
1.09 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.