UNICA IRIS Institutional Research Information System

Many anomaly-based malware detectors implicitly depend on the automatic categorisation of Android apps. Such tools first group apps by their declared functionality and then learn what constitutes "normal" use of sensitive APIs within each group. If this categorisation is wrong or overly coarse, the entire pipeline suffers: benign apps can be flagged as malicious, while malicious or grayware apps can hide among poorly matched neighbours. This highlights the need for precise, robust, and scalable categorisation methods that operate directly on app stores’ metadata and can be plugged into security pipelines. In this paper, we present a free and fully automatic description-based framework for fine-grained categorisation of Android applications, explicitly designed as a drop-in replacement for the categorisation stage in anomaly-based systems. Our pipeline embeds Google Play Store descriptions with a sentence-transformer model, reduces the embeddings with Uniform Manifold Approximation and Projection, and clusters them using K-means while automatically selecting the number of clusters via the Mean Silhouette Coefficient. Finally, it employs a lightweight Large Language Model to generate concise, human-readable labels for each discovered cluster. We evaluate our approach on AndroCatSet, a manually curated ground-truth dataset of 5000 benign apps organised into 50 fine-grained classes. The resulting categorisation component yields semantically coherent and interpretable functional groups, which can be readily integrated into security pipelines to strengthen the detection of miscategorised, malicious, and grayware apps whose actual behaviour diverges from their declared purpose.

LLM-Based Auto-Categorization of Android Applications

Pierangelo Loi^Primo;Diego Soi;Leonardo Regano;Giorgio Giacinto

2026-01-01

Abstract

Many anomaly-based malware detectors implicitly depend on the automatic categorisation of Android apps. Such tools first group apps by their declared functionality and then learn what constitutes "normal" use of sensitive APIs within each group. If this categorisation is wrong or overly coarse, the entire pipeline suffers: benign apps can be flagged as malicious, while malicious or grayware apps can hide among poorly matched neighbours. This highlights the need for precise, robust, and scalable categorisation methods that operate directly on app stores’ metadata and can be plugged into security pipelines. In this paper, we present a free and fully automatic description-based framework for fine-grained categorisation of Android applications, explicitly designed as a drop-in replacement for the categorisation stage in anomaly-based systems. Our pipeline embeds Google Play Store descriptions with a sentence-transformer model, reduces the embeddings with Uniform Manifold Approximation and Projection, and clusters them using K-means while automatically selecting the number of clusters via the Mean Silhouette Coefficient. Finally, it employs a lightweight Large Language Model to generate concise, human-readable labels for each discovered cluster. We evaluate our approach on AndroCatSet, a manually curated ground-truth dataset of 5000 benign apps organised into 50 fine-grained classes. The resulting categorisation component yields semantically coherent and interpretable functional groups, which can be readily integrated into security pipelines to strengthen the detection of miscategorised, malicious, and grayware apps whose actual behaviour diverges from their declared purpose.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				Android; App Categorisation; LLM; Clustering; Embeddings
			
	Tipologia:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
LLM-Based Auto-Categorization of Android Applications.pdf accesso aperto Tipologia: versione editoriale (VoR) Dimensione 1.23 MB Formato Adobe PDF Visualizza/Apri	1.23 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/479126

Citazioni

ND

ND

ND

social impact