UNICA IRIS Institutional Research Information System

The automated transformation of unstructured HTML into schema-compliant data is a foundational challenge for platform interoperability and the scalability of modern no-code web editors. While powerful, Large Language Models (LLMs) are often ill-suited for this task due to their inherent stochasticity and computational cost, failing to guarantee the deterministic precision required at scale. This paper addresses this challenge by introducing a novel, deterministic pipeline to translate arbitrary HTML emails into the proprietary, grid-based JSON of the Beefree content platform. Our core contributions are: (1) a hybrid methodology that combines Document Object Model (DOM) analysis for semantics with computer vision for geometric layout interpretation; (2) a vision-based abstraction technique using visual placeholders for robust row-column detection, resilient to DOM structural variations; and (3) a rigorous, dual-faceted validation of its real-world viability via a large-scale assessment on over 16,000 HTML emails and qualitative usability studies (SUS) with 16 industry professionals and 10 academic researchers. The results confirm our deterministic, vision-augmented approach is a highly effective and scalable alternative to generative models for structured content creation in production environments.

Efficient translation of HTML to JSON for enhanced web content production

Di Stefano A.;Fadda M.;Marini C.;Ramzan F.;Reforgiato Recupero D.

2026-01-01

Abstract

The automated transformation of unstructured HTML into schema-compliant data is a foundational challenge for platform interoperability and the scalability of modern no-code web editors. While powerful, Large Language Models (LLMs) are often ill-suited for this task due to their inherent stochasticity and computational cost, failing to guarantee the deterministic precision required at scale. This paper addresses this challenge by introducing a novel, deterministic pipeline to translate arbitrary HTML emails into the proprietary, grid-based JSON of the Beefree content platform. Our core contributions are: (1) a hybrid methodology that combines Document Object Model (DOM) analysis for semantics with computer vision for geometric layout interpretation; (2) a vision-based abstraction technique using visual placeholders for robust row-column detection, resilient to DOM structural variations; and (3) a rigorous, dual-faceted validation of its real-world viability via a large-scale assessment on over 16,000 HTML emails and qualitative usability studies (SUS) with 16 industry professionals and 10 academic researchers. The results confirm our deterministic, vision-augmented approach is a highly effective and scalable alternative to generative models for structured content creation in production environments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Parole chiave
	
				Html management
Html translation
Large language models
Web content production
			
	Tipologia:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
s11042-026-21232-7-1.pdf Solo gestori archivio Tipologia: versione editoriale (VoR) Dimensione 2.38 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.38 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
s11042-026-21232-7-1 (1) (1) (1).pdf embargo fino al 23/01/2027 Tipologia: versione post-print (AAM) Dimensione 2.22 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.22 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/480251

Citazioni

ND

0

ND

ND

social impact