UNICA IRIS Institutional Research Information System

Context: As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption. Objective: This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities. Methods: We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies. Results: Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions. Conclusions: Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.

Reasoned or Rapid code? Unveiling the strengths and limits of DeepSeek for Solidity development

Gavina Baralla;Giacomo Ibba;Roberto Tonelli

2026-01-01

Abstract

Context: As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption. Objective: This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities. Methods: We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies. Results: Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions. Conclusions: Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Parole chiave
	
				DeepSeek; Smart contracts; LLM Reasoning; Vulnerabilities
			
	Tipologia:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0950584925002563-main.pdf accesso aperto Descrizione: Articolo online Tipologia: versione editoriale (VoR) Dimensione 1.96 MB Formato Adobe PDF Visualizza/Apri	1.96 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/456107

Citazioni

ND

0

0

social impact