UNICA IRIS Institutional Research Information System

This paper introduces a methodology for software vulnerability detection that combines structural and semantic analysis through software metrics and topic modelling. We evaluate the approach using smart contracts as a case study, focusing on their structural properties and the presence of known security vulnerabilities. We identify the most relevant metrics for vulnerability detection, evaluate multiple machine learning classifiers for both binary and multi-label classification, and improve classification performance by integrating topic modelling techniques. Our analysis shows that metrics such as cyclomatic complexity, nesting depth, and function calls are strongly associated with vulnerability presence. Using these metrics, the Random Forest classifier achieved strong performance in binary classification (AUC: 0.982, accuracy: 0.977, F1-score: 0.808) and multi-label classification (AUC: 0.951, accuracy: 0.729, F1-score: 0.839). The addition of topic modelling using Non-Negative Matrix Factorisation further improved results, increasing the F1-score to 0.881. The evaluation is conducted on Ethereum smart contracts written in Solidity.

A machine learning approach to vulnerability detection combining software metrics and topic modelling: Evidence from smart contracts

Ibba, Giacomo;Neykova, Rumyana;Ortu, Marco;Tonelli, Roberto;Counsell, Steve;Destefanis, Giuseppe

2025-01-01

Abstract

This paper introduces a methodology for software vulnerability detection that combines structural and semantic analysis through software metrics and topic modelling. We evaluate the approach using smart contracts as a case study, focusing on their structural properties and the presence of known security vulnerabilities. We identify the most relevant metrics for vulnerability detection, evaluate multiple machine learning classifiers for both binary and multi-label classification, and improve classification performance by integrating topic modelling techniques. Our analysis shows that metrics such as cyclomatic complexity, nesting depth, and function calls are strongly associated with vulnerability presence. Using these metrics, the Random Forest classifier achieved strong performance in binary classification (AUC: 0.982, accuracy: 0.977, F1-score: 0.808) and multi-label classification (AUC: 0.951, accuracy: 0.729, F1-score: 0.839). The addition of topic modelling using Non-Negative Matrix Factorisation further improved results, increasing the F1-score to 0.881. The evaluation is conducted on Ethereum smart contracts written in Solidity.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Parole chiave
	
				Vulnerability detection
Software metrics
Topic modelling
Machine learning
Source code analysis
Smart contracts
			
	Tipologia:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S2666827025001422-main.pdf accesso aperto Tipologia: versione editoriale (VoR) Dimensione 2.68 MB Formato Adobe PDF Visualizza/Apri	2.68 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/462385

Citazioni

ND

ND

0

social impact