Can the integration of vision and language representations advance artificial intelligence methods for automated white blood cell (WBC) analysis across heterogeneous clinical conditions? Motivated by this question, we present WBC-CLIP, a dual-encoder framework that enhances WBC classification and analysis by combining image data with rich textual descriptions derived from quantitative morphological features. Our method leverages multiple large language models to convert numerical and categorical cell attributes into diverse, semantically enriched textual descriptions. These captions are jointly embedded with their corresponding WBC images using a contrastive learning strategy inspired by the CLIP architecture, enabling the model to learn stable and meaningful cross-modal associations. We evaluate WBC-CLIP through zero-shot classification and image–text retrieval tasks across both in-distribution and out-of-distribution datasets. The framework advances automated WBC analysis while providing improved explainability by explicitly grounding visual representations in morphology-aware textual descriptors, addressing key challenges in computer-aided diagnostics.
WBC-CLIP: A multimodal vision-language framework for morphology aware white blood cell analysis
Zedda L.;Mura D. A.
;Di Ruberto C.;Loddo A.
2026-01-01
Abstract
Can the integration of vision and language representations advance artificial intelligence methods for automated white blood cell (WBC) analysis across heterogeneous clinical conditions? Motivated by this question, we present WBC-CLIP, a dual-encoder framework that enhances WBC classification and analysis by combining image data with rich textual descriptions derived from quantitative morphological features. Our method leverages multiple large language models to convert numerical and categorical cell attributes into diverse, semantically enriched textual descriptions. These captions are jointly embedded with their corresponding WBC images using a contrastive learning strategy inspired by the CLIP architecture, enabling the model to learn stable and meaningful cross-modal associations. We evaluate WBC-CLIP through zero-shot classification and image–text retrieval tasks across both in-distribution and out-of-distribution datasets. The framework advances automated WBC analysis while providing improved explainability by explicitly grounding visual representations in morphology-aware textual descriptors, addressing key challenges in computer-aided diagnostics.| File | Dimensione | Formato | |
|---|---|---|---|
|
2026_IMAVIS_WBC-CLIP_A multimodal vision-language framework for morphology aware white blood cell analysis.pdf
accesso aperto
Descrizione: Articolo completo
Tipologia:
versione editoriale (VoR)
Dimensione
2.69 MB
Formato
Adobe PDF
|
2.69 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


