UNICA IRIS Institutional Research Information System

Can the integration of vision and language representations advance artificial intelligence methods for automated white blood cell (WBC) analysis across heterogeneous clinical conditions? Motivated by this question, we present WBC-CLIP, a dual-encoder framework that enhances WBC classification and analysis by combining image data with rich textual descriptions derived from quantitative morphological features. Our method leverages multiple large language models to convert numerical and categorical cell attributes into diverse, semantically enriched textual descriptions. These captions are jointly embedded with their corresponding WBC images using a contrastive learning strategy inspired by the CLIP architecture, enabling the model to learn stable and meaningful cross-modal associations. We evaluate WBC-CLIP through zero-shot classification and image–text retrieval tasks across both in-distribution and out-of-distribution datasets. The framework advances automated WBC analysis while providing improved explainability by explicitly grounding visual representations in morphology-aware textual descriptors, addressing key challenges in computer-aided diagnostics.

WBC-CLIP: A multimodal vision-language framework for morphology aware white blood cell analysis

Zedda L.;Mura D. A.;Manzo A.;Di Ruberto C.;Loddo A.

2026-01-01

Abstract

Can the integration of vision and language representations advance artificial intelligence methods for automated white blood cell (WBC) analysis across heterogeneous clinical conditions? Motivated by this question, we present WBC-CLIP, a dual-encoder framework that enhances WBC classification and analysis by combining image data with rich textual descriptions derived from quantitative morphological features. Our method leverages multiple large language models to convert numerical and categorical cell attributes into diverse, semantically enriched textual descriptions. These captions are jointly embedded with their corresponding WBC images using a contrastive learning strategy inspired by the CLIP architecture, enabling the model to learn stable and meaningful cross-modal associations. We evaluate WBC-CLIP through zero-shot classification and image–text retrieval tasks across both in-distribution and out-of-distribution datasets. The framework advances automated WBC analysis while providing improved explainability by explicitly grounding visual representations in morphology-aware textual descriptors, addressing key challenges in computer-aided diagnostics.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Parole chiave
	
				Contrastive learning; Deep learning; Image-text retrieval; LLM; White blood cell analysis; Zero-shot classification
			
	Tipologia:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
2026_IMAVIS_WBC-CLIP_A multimodal vision-language framework for morphology aware white blood cell analysis.pdf accesso aperto Descrizione: Articolo completo Tipologia: versione editoriale (VoR) Dimensione 2.69 MB Formato Adobe PDF Visualizza/Apri	2.69 MB	Adobe PDF	Visualizza/Apri

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/482225

Citazioni

ND

0

ND

ND

social impact