The automated transformation of unstructured HTML into schema-compliant data is a foundational challenge for platform interoperability and the scalability of modern no-code web editors. While powerful, Large Language Models (LLMs) are often ill-suited for this task due to their inherent stochasticity and computational cost, failing to guarantee the deterministic precision required at scale. This paper addresses this challenge by introducing a novel, deterministic pipeline to translate arbitrary HTML emails into the proprietary, grid-based JSON of the Beefree content platform. Our core contributions are: (1) a hybrid methodology that combines Document Object Model (DOM) analysis for semantics with computer vision for geometric layout interpretation; (2) a vision-based abstraction technique using visual placeholders for robust row-column detection, resilient to DOM structural variations; and (3) a rigorous, dual-faceted validation of its real-world viability via a large-scale assessment on over 16,000 HTML emails and qualitative usability studies (SUS) with 16 industry professionals and 10 academic researchers. The results confirm our deterministic, vision-augmented approach is a highly effective and scalable alternative to generative models for structured content creation in production environments.

Efficient translation of HTML to JSON for enhanced web content production

Di Stefano A.;Ramzan F.;Reforgiato Recupero D.
2026-01-01

Abstract

The automated transformation of unstructured HTML into schema-compliant data is a foundational challenge for platform interoperability and the scalability of modern no-code web editors. While powerful, Large Language Models (LLMs) are often ill-suited for this task due to their inherent stochasticity and computational cost, failing to guarantee the deterministic precision required at scale. This paper addresses this challenge by introducing a novel, deterministic pipeline to translate arbitrary HTML emails into the proprietary, grid-based JSON of the Beefree content platform. Our core contributions are: (1) a hybrid methodology that combines Document Object Model (DOM) analysis for semantics with computer vision for geometric layout interpretation; (2) a vision-based abstraction technique using visual placeholders for robust row-column detection, resilient to DOM structural variations; and (3) a rigorous, dual-faceted validation of its real-world viability via a large-scale assessment on over 16,000 HTML emails and qualitative usability studies (SUS) with 16 industry professionals and 10 academic researchers. The results confirm our deterministic, vision-augmented approach is a highly effective and scalable alternative to generative models for structured content creation in production environments.
2026
Html management
Html translation
Large language models
Web content production
File in questo prodotto:
File Dimensione Formato  
s11042-026-21232-7-1.pdf

Solo gestori archivio

Tipologia: versione editoriale (VoR)
Dimensione 2.38 MB
Formato Adobe PDF
2.38 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
s11042-026-21232-7-1 (1) (1) (1).pdf

embargo fino al 23/01/2027

Tipologia: versione post-print (AAM)
Dimensione 2.22 MB
Formato Adobe PDF
2.22 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/480251
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact