Can the integration of vision and language representations advance artificial intelligence methods for automated white blood cell (WBC) analysis across heterogeneous clinical conditions? Motivated by this question, we present WBC-CLIP, a dual-encoder framework that enhances WBC classification and analysis by combining image data with rich textual descriptions derived from quantitative morphological features. Our method leverages multiple large language models to convert numerical and categorical cell attributes into diverse, semantically enriched textual descriptions. These captions are jointly embedded with their corresponding WBC images using a contrastive learning strategy inspired by the CLIP architecture, enabling the model to learn stable and meaningful cross-modal associations. We evaluate WBC-CLIP through zero-shot classification and image–text retrieval tasks across both in-distribution and out-of-distribution datasets. The framework advances automated WBC analysis while providing improved explainability by explicitly grounding visual representations in morphology-aware textual descriptors, addressing key challenges in computer-aided diagnostics.

WBC-CLIP: A multimodal vision-language framework for morphology aware white blood cell analysis

Zedda L.;Mura D. A.
;
Di Ruberto C.;Loddo A.
2026-01-01

Abstract

Can the integration of vision and language representations advance artificial intelligence methods for automated white blood cell (WBC) analysis across heterogeneous clinical conditions? Motivated by this question, we present WBC-CLIP, a dual-encoder framework that enhances WBC classification and analysis by combining image data with rich textual descriptions derived from quantitative morphological features. Our method leverages multiple large language models to convert numerical and categorical cell attributes into diverse, semantically enriched textual descriptions. These captions are jointly embedded with their corresponding WBC images using a contrastive learning strategy inspired by the CLIP architecture, enabling the model to learn stable and meaningful cross-modal associations. We evaluate WBC-CLIP through zero-shot classification and image–text retrieval tasks across both in-distribution and out-of-distribution datasets. The framework advances automated WBC analysis while providing improved explainability by explicitly grounding visual representations in morphology-aware textual descriptors, addressing key challenges in computer-aided diagnostics.
2026
Contrastive learning; Deep learning; Image-text retrieval; LLM; White blood cell analysis; Zero-shot classification
File in questo prodotto:
File Dimensione Formato  
2026_IMAVIS_WBC-CLIP_A multimodal vision-language framework for morphology aware white blood cell analysis.pdf

accesso aperto

Descrizione: Articolo completo
Tipologia: versione editoriale (VoR)
Dimensione 2.69 MB
Formato Adobe PDF
2.69 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/482225
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact