Leveraging Vision-Language Foundation Models in Medical Imaging

Abstract

Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.

1. Introduction

Dataset biases present a significant challenge in the development of trustworthy machine learning models in healthcare applications. These biases can manifest as spurious correlations – misleading patterns between attributes of the data – which cause the model to learn incorrect associations that do not reflect true clinical relationships. For instance, algorithms trained on chest X-ray learned to classify pneumothorax based on the presence of a chest drain, which is inserted after pneumothorax diagnosis to treat it, and thus does not reflect the true pathological disease markers.

Causal-based approaches aim to answer causal queries and consequentially overcome challenges with spurious correlations by explicitly enforcing attribute relationships within a causal framework. These methods incorporate causal principles by embedding Structural Causal Models (SCMs) into the generation process and adhering to Pearl's Causal Hierarchy Theorem (CHT) to generate results. However, many SCM-based methods still require external classifiers to function effectively. Moreover, despite their theoretical rigor, the assumptions these methods require for theoretical validity often become unrealistic in real-world applications.

Foundation models, trained on vast and diverse datasets, have demonstrated remarkable generative capabilities in both computer vision and medical imaging. Two significant breakthroughs are particularly relevant: (i) enhanced image generation capabilities that enables the generation of unprecedented high-resolution images, and (ii) precise targeted editing of specific image attributes.

In this work, we explore the strengths and limitations of fine-tuned VLMs in generating high-resolution, realistic medical images while analyzing their dependence on, and ability to reveal, underlying dataset properties.

Comparison of counterfactual generation in the SOTA structural causal model (SCM) and a foundation vision-language model.

2. Methodology

2.1 Background

Structural Causal Models & Counterfactual Generation: Our work investigates how fine-tuned vision-language foundation models can generate high-resolution, faithful counterfactual images. We discuss counterfactual generation using Structural Causal Models (SCMs) to provide context. These SCMs enable counterfactual reasoning for answering questions such as: What would image X look like if attribute Ai = a had instead been Ai = a′?

Stable Diffusion, our chosen generative model for counterfactual generation, is comprised of four components: (i) an Image encoder that transforms input images into low-dimensional latent representations; (ii) a CLIP text encoder that provides text conditioning to guide generation; (iii) a U-net denoiser that forms the core of the reverse diffusion process; and (iv) an Image decoder that converts denoised latents back to the original image space.

2.2 Training & Inference using Stable Diffusion

Unlike SCM-based approaches that explicitly enforce causal relationships through predefined graphs, our method operates without imposing assumptions about attribute dependencies. We leverage language-guidance as our conditioning mechanism by converting attribute labels into textual descriptions.

During inference, generating counterfactual medical images with the fine-tuned model is easily accomplished through a slight modification of the text prompt in a zero-shot manner. In line with interventions in Structural Causal Models (SCMs), counterfactual image generation only requires specifying the attributes we explicitly wish to change.

2.3 Evaluation of Synthesized Counterfactuals

We aim to validate the effectiveness of our method based on several key criteria: (i) Precision in High-Fidelity Counterfactual Image Editing; (ii) Comparison with SOTA SCM methods; and (iii) Unveiling Hidden Data Patterns.

For quantitative evaluation, we use metrics such as Perceptual Similarity, Identity Preservation, and Effectiveness to ensure that CF images meet key criteria.

3. Results

Our high-resolution results not only align with the baseline method to what interventions in the image should be made, but also maintain high fidelity to the original image. The baseline method's pre-defined SCM includes an edge between age and pleural effusion, suggesting a potential causal relationship. Interestingly, without the need for an explicit SCM, the fine-tuned VLM also demonstrates an effect on pleural effusion when age is modified.

Comparison of counterfactual image generation results using our proposed method vs. baseline, a SOTA method that employs an explicit SCM for generation.

Notably, the fine-tuned vision-language model associates cardiomegaly specifically with pacemakers when applying prompt-based modifications. This is evident when prompting the model to remove cardiomegaly, as it also removes the pacemaker. This relationship is supported by the literature, which indicates a bidirectional relationship between cardiomegaly and pacemaker implantation.

Revealing hidden image-attribute relationships from prompt modifications. Notably, removing cardiomegaly also results in the specific removal of the pacemaker, but not other support devices, suggesting a hidden correlation in the training data.

4. Conclusion

Our work highlights the potential of fine-tuned vision-language foundation models in identifying critical data biases and spurious correlations. By leveraging their ability to generate high-resolution, precisely edited images through language prompts, these models reveal hidden data patterns that were previously undetectable. This has significant implications for the development of VLM-based methods in healthcare, where understanding dataset biases is crucial for building robust, trustworthy, and clinically deployable models. Our findings also underscore the limitations of fine-tuned VLMs, including their susceptibility to spurious correlations. Future work will focus on integrating causal reasoning to mitigate the reliance on biases and enhance the reliability of these models in clinical applications.