Context: As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption. Objective: This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities. Methods: We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies. Results: Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions. Conclusions: Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.
Reasoned or Rapid code? Unveiling the strengths and limits of DeepSeek for Solidity development
Gavina Baralla
;Giacomo Ibba
;Roberto Tonelli
2026-01-01
Abstract
Context: As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption. Objective: This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities. Methods: We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies. Results: Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions. Conclusions: Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.| File | Dimensione | Formato | |
|---|---|---|---|
|
1-s2.0-S0950584925002563-main.pdf
accesso aperto
Descrizione: Articolo online
Tipologia:
versione editoriale (VoR)
Dimensione
1.96 MB
Formato
Adobe PDF
|
1.96 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


