PRISM: Paper Walkthrough

Audio Paper Walkthrough

Listen to our researchers explain key concepts and findings from the PRISM paper. The transcript will highlight automatically as the audio plays. You can also click on any part of the transcript to jump to that section.

Note: This audio was generated using Notebook LM, an AI language model. The content represents our research, but the voice is synthetically generated. The transcript is also generated using voice-to-text AI tool.

Jump to a topic:

Ready to play

Speaker 1 [0:00]
When we talk about AI in medicine, you know, looking at all those scans and X rays, it's easy to get really excited about the potential. Oh, absolutely. But when you start to peel back the layers, yeah, there are some, some real challenges. And I think what it boils down to is that the AI right now, at least, could be getting tricked in a way, right, mistaking a medical device, something that's commonly seen in an image as a sign of a disease. Yeah,

Speaker 2 [0:29]
it's all about patterns, right? It's learning these patterns, but sometimes those patterns, they're not actually medically relevant. It's really fascinating, though, how these, what we call spurious correlations can actually throw a wrench into the whole reliability of AI in healthcare. Yeah. I mean, imagine an AI model that repeatedly sees a certain device in X rays of patients with, let's say, a specific heart condition. Okay, it might, it might start to make this incorrect connection and think, Oh, the presence of that device means that heart condition, right? Even if there's no real biological link there, yeah,

Speaker 1 [1:03]
it's just, it's just what it's seen a bunch of times. Yeah, exactly. So that leads us to this deep dive that we're doing today. Absolutely. We're looking at a research paper, okay, titled prism, high resolution and precise counter factual medical image generation using language guided stable diffusion. Wow, that's a mouthful. It is a mouthful. Yeah, but this is coming out of MIDL 2025, Oh, very cool. And the researchers are Amar Kumar, Anita Kriz, Mohammed, Havaei and Tal Arbel from McGill University, MILA and Google research. Okay, so our mission for you listening today is to really understand how this new AI framework prism, yeah, is kind of tackling these, these really tricky issues of bias and reliability in medical AI, right, right? And it really all hinges on this idea of generating these really realistic medical images, but with specific alterations. Okay?

Speaker 2 [1:58]
So it's almost like, um, almost like training the AI to, sort of, like, I don't know, like, put on these special glasses, yes, that help it, to see the actual important medical information and ignore all the irrelevant stuff the noise. Yeah, like that analogy. So,

Speaker 1 [2:13]
so let's unpack this term a little bit. Okay, counterfactual images, I think, in the simplest terms, yeah, what we're talking about are images that would show, you know, what would have been there if something was different? Oh, okay, you have an x ray and it has a pacemaker in it, right? A counterfactual image would be like, what would that X ray look like without the pacemaker? Or what if, you know, a patient didn't have a certain disease? Yeah, you could generate an image of you know that

Speaker 1 [2:41]
scenario, like a

Speaker 2 [2:42]
what if scenario. It is a what if scenario for medical imaging. That's really cool. Yeah. So where are we getting all this information from? What's the source material for our deep dive today?

Speaker 1 [2:53]
Yeah. So the the basis for this entire deep dive is a full research paper. It was submitted to the medical imaging with Deep Learning Conference. That's MIDL, MIDL, okay, in 2025, so this is, we are right on

Speaker 1 [3:07]
the cutting edge here. So

Speaker 2 [3:09]
fresh off the press, so fresh, yeah, so let's really kind of dig into this problem. Yeah, that prism is trying to address. I know we touched on it briefly, but understanding kind of the core challenges is, is key to really appreciating why this is so significant, okay, so deep learning in medical imaging, right? Incredibly powerful, sure, but it's not perfect. It has its limits,

Speaker 1 [3:38]
yeah, and one of

Speaker 1 [3:40]
the limitations that we bump into a lot is, as we've been talking about these spurious correlations, yeah, where the AI, you know, it grabs onto these patterns in the data, but they're just coincidences, right? They don't actually tell us about the medical condition, yeah? Like, for instance, let's say there's a hospital system, right, that always uses a specific brand of, I don't know, a heart valve, right, on patients with a certain type of heart disease, okay, well, the AI might mistakenly learn that that heart valve brand, like the device itself, is somehow linked to the disease, right? Even though there's no real medical connection there,

Speaker 2 [4:13]
it's just that that hospital happens to use that brand, yeah, so that's one challenge. Then you've also got the issue of data imbalances, okay, some medical conditions are, well, thankfully they're less common, right? Which means the AI doesn't have as many examples to learn from, yeah, which makes its job a lot harder. Makes sense. And then on top of that, many medical image data sets, they just don't have these detailed text descriptions. They might just say, you know, disease present, disease absent very basic stuff, right? But they lack that rich contextual information that could really help an AI understand what's going on in the images, right?

Speaker 2 [4:54]
The nuance, exactly the nuance. So

Speaker 1 [4:57]
now, researchers have looked into this idea of gender. Creating counterfactual images before, right? Yeah, that's right, but, but there's a catch, there's a catch. There's a bit of a paradox here. A paradox Tell me more, because a lot of those approaches in the past, things like the classifier guided methods or or methods based on structural causal models, yeah, often they rely on the same potentially flawed data, right that they're trying

Speaker 1 [5:21]
to correct for in the first place

Speaker 2 [5:22]
I see the issue. So if your initial AI, the one that's guiding this whole counter factual generation process, already has a bias, right? Let's say it's associating that heart valve with a specific heart condition, yeah, even when you ask it to create an image without the heart condition, right, it might still remove the heart valve, right? Because it thinks those two things are always connected, yeah, even though, in reality they might not be, it's learned the wrong thing exactly. And then there's the technical side of things, yeah, it's not easy to build these AI systems, these end to end systems for medical imaging. They face this constant balancing act, because to generate a really detailed, you know, high quality image, they need to look at the tiny little details, the pixels, right? But to actually classify what's in that image, yeah, they need to understand these bigger, more abstract concepts, right, and juggling those two things at the same time, yeah, that's not easy. That's

Speaker 1 [6:22]
really tough and and on top of that, all right, you have the challenge of creating a high resolution image, yeah, but also making sure it's precise.

Speaker 2 [6:30]
Precise. What does that mean in this context? Yeah, so precise

Speaker 1 [6:34]
means that it only alters the specific thing that you want to change without messing with the rest of the image. Got it, and that has been a big technical hurdle. Makes sense? Okay, so we've talked about the challenges. Let's talk about prism and their approach. Okay, yeah, let's hear it. So they are harnessing what they call foundation models. Foundation models. What are those? So foundation models are these very large vision language AI models that have been trained on like, massive amounts of data, oh, wow, images and text and so that gives them this really broad understanding of visual concepts, okay, and how those relate to language. So what they do is they take stable diffusion, okay, yeah, I've heard of that, which is a really advanced model for image generation, all right, and they adapt it to medical imaging, okay, so you can kind of think of stable diffusion as, like this, really talented. Artist, okay, that can create photorealistic images from text descriptions, yeah? And prism is basically saying, okay. Artist, now focus on medical images. Oh,

Speaker 2 [7:38]
I like that. It's like specializing, yeah? Specializing that

Speaker 1 [7:41]
artist. Yeah. So the key innovation, the breakthrough here is that it can use language okay to guide the generation Okay. Of these high resolution 512, by five on 12 pixel medical counterfactual images, wow. With a level of precision we haven't really seen before,

Speaker 2 [8:00]
that's pretty amazing, especially at that resolution. Yeah, thinking back to what you were saying about previous attempts, yeah, things like, what was it? Biomed journey, yeah, biomed journey, they could generate counter factuals using language, but they were kind of stuck with these lower resolution images, and they weren't really built to deal with these big, clunky artifacts, like medical devices, yeah, we've been talking about right? And then there was radit Right, which also used language for editing, but it relied on these manually drawn masks to define where the edits should happen, right? And that limited its ability to create truly free flowing, you know, unconstrained counter factuals, yeah? Because

Speaker 1 [8:41]
you have to, you're kind of pre defining where the changes can happen exactly, yeah. But

Speaker 2 [8:45]
mazem, it sidesteps all of that. It lets you make precise, targeted edits without needing those masks, yeah,

Speaker 1 [8:53]
so going back to this, this issue of limited text in these data sets, right, right? How does prism get around that?

Speaker 2 [8:59]
Well, what it does is, it's actually really clever. Okay? It takes those basic binary labels that you typically see in medical data sets, you know, yes or no for a disease, right? And it turns them into actual descriptive text captions. Oh, wow, yeah. It's like giving the AI something it can actually read. So walk me through an example, okay, so let's say you have an x ray, and it's labeled as showing pleural effusion, okay, that's fluid in the lungs, right? And cardiomegaly, which means an enlarged heart, okay? Prism would take those labels and transform them into a caption, like chest X ray of a patient, showing pleural effusion, cardiomegaly, okay, or if it was an image with no problems, it might say normal chest X ray with no significant findings, okay, so

Speaker 1 [9:48]
it's creating more descriptive, yeah, almost human readable, yeah, labels

Speaker 2 [9:51]
exactly, and that allows it to really tap into those language understanding capabilities of this foundation model. Okay,

Speaker 1 [9:58]
so that's really neat, how it. Kind of translates that, yeah. So how does it actually work? How does prism actually generate those images? All right, so

Speaker 2 [10:05]
let's break it down into three main steps. Okay, first, as we just talked about, it, takes that tabular data, like the disease labels, and converts it into those text captions, right? It uses templates for this and also something called CLP. CLP, yeah,

Speaker 1 [10:19]
C, l, a, p, which stands for contrastive language image pre training, okay? And clap helps to create what are called language embeddings, okay, which are basically numerical representations of the text that the image generation model can understand.

Speaker 2 [10:33]
Okay, they need to translate that into numbers, right? Exactly. Needs

Speaker 1 [10:37]
to speak the language of math, yeah. Okay, so that's step one. Step two is fine tuning. Okay, we take stable diffusion and we train it on a massive collection of medical images. Right? In this particular study, they use the checks Bert data set, okay, which has tons of chest X rays, okay. Now, within stable diffusion, there are a few key parts working together. Okay, there's the VAE, the variational auto encoder, right, which basically compresses the image information and it can reconstruct it later. Okay. Then you've got the U net, which operates on that compressed information, and it learns how to remove noise,

Speaker 2 [11:14]
and that noise removal process, that's the heart of generating new images, okay. And then, of course, you've got that clip encoder we talked about earlier, which brings in the text understanding right now, the important thing here is that during this fine tuning process, only the unit is actually trained. Oh, okay, the CLI T model, it's already pre trained, right? It already has this general understanding of how images and text relate. It's coming in with some knowledge Exactly. So they're not starting from scratch, okay, they're building upon this foundation. Makes sense. So step three. Step three is the magic moment. Okay, this is where you actually generate the counter factual image. You use that fine tuned model, and you give it a modified text prompt, okay, so let's say you have an x ray with a medical device, right? And the original prompt was something like chest X ray showing a central line. You could change that prompt to chest X ray without a central line, okay? And then prism uses a technique called D dim, yeah, which is short for de noising diffusion, implicit models, okay, along with something called null text inversion, interesting. Now, without getting too technical, the main goal of these techniques is to make sure that the new image that's generated, right, it stays true to the original okay, it only changes the specific thing you asked for in the new prompt. Yeah. It's all about controlled editing.

Speaker 1 [12:37]
So how does the text prompt, you know, how does it actually guide that image generation?

Speaker 2 [12:43]
Okay, so think about it this way. Okay, inside the unit, there are these things called cross attention modules. Cross attention modules, yeah, these modules allow the model to focus on different parts of the text prompt. Okay, as it's generating different parts of the image. Okay, so if the prompt says no pleural effusion, right? The attention mechanism will zero in on the lung areas in the image, okay? And make sure that there's no fluid in those regions. Okay? It's like the prompt is highlighting what to pay attention to. It's like a conductor,

Speaker 1 [13:12]
yeah.

Speaker 1 [13:14]
Okay, so this is really helpful. We're kind of getting under the hood of how this all works, yeah. So the big question,

Speaker 2 [13:21]
did it work, right? Does it actually do what it claims to do? Did they what were the results? Okay, so they did a lot of testing. Okay, their main experiments were done on that big checks expert data set we talked about, right, which, as you mentioned, has over 200,000 chest X rays and labels for 14 different medical conditions, right? But they didn't stop there, okay, they also did experiments on dermascopic images. Okay, those are images of skin lesions, okay? And they got these from the ISIC data set, okay, just to show that this approach isn't just limited to chest X rays, right? It's more general, exactly. It's more broadly applicable.

Speaker 1 [13:58]
So you mentioned earlier that they compared prism to other existing methods.

Speaker 2 [14:02]
Oh, yeah, that's super important, right? You got to see how it stacks up against the competition, yeah. So they used Ganter factual as a benchmark,

Speaker 1 [14:11]
okay? And remind us what is Gantt or factual?

Speaker 2 [14:15]
So Gantt or factual is a model that's based on generative adversarial networks, or Jans for short, okay. And it uses a classifier to guide the process of generating these counterfactual images, okay, but there were a couple of problems they found with counterfactual okay. First, it needs those lower resolution images, okay. It can only handle images that are 224, by 224, pixels, okay. And second, it wasn't very good at cleanly removing medical devices. Okay? To actually see this in Figure 3b of the paper. So

Speaker 1 [14:47]
how did they measure, like, objectively measure how well prism was doing, right? You

Speaker 2 [14:52]
need some solid metrics to compare them. Yeah. So they used a couple of key metrics. Okay? One was subject identity preservation, okay. Subject. Identity, treservation, yeah. And this is measured using what's called the l1 distance, okay. Basically, you're calculating the difference between the original image, right, and the generated counter factual image, okay, and a lower l1 distance means that the generated image looks more like the original with only those intended modifications, right? So when it came to removing devices, yeah, prism got a much lower l1 distance than the baseline. Okay,

Speaker 1 [15:24]
so it was better at preserving the identity of the original image. Yeah, it was

Speaker 2 [15:29]
making those targeted changes without distorting the rest of the image. Okay, which is something the previous methods struggled with, okay? And what was the other metric? Okay, the other big one was counterfactual prediction game, counterfactual prediction game, yeah, or CPG for short, okay. And CPG, it measures how much the AI's classification of the image changes, okay, after you make that counter factual modification, right? So a higher CPG score means that the generated image is more likely to be classified correctly as the counter factual case. Okay. So for example, if you remove a device right, a higher CPG means the AI is more likely to recognize that the device is gone. Oh, okay. And again, prism did better here too. It got a higher CPG score than the baseline. Okay,

Speaker 1 [16:16]
so it was more effective at generating images that reflected the change you wanted to make exactly? Yeah, that's pretty cool. Yeah. So those are the numbers. What about, you know, subjectively, what the images actually look like,

Speaker 2 [16:27]
right? The numbers are great, yeah, but seeing is believing, yeah, exactly. So some of the visual examples in the paper are really striking, okay, like figure three, for instance, it clearly shows how prism can cleanly remove these complex medical devices, yeah, things like wires, pacemakers, from X rays, without affecting any signs of disease that might be present. Okay, and you can really see it, the devices just vanish. But the actual medical condition in the image remains untouched. So it's not it's not like blurring it out or No, no, no, like they were never there. Yeah, and it's not just removing things, okay, prism can also add devices back into the images. Oh, wow, yeah, it's pretty impressive. Okay? And the baseline method, it really struggled to do these device edits effectively, okay?

Speaker 1 [17:16]
So this really addresses that issue of spurious correlations.

Speaker 2 [17:20]
Exactly. It gets right to the heart of that problem, yeah. What about

Speaker 1 [17:23]
modifying, like the disease itself, like the appearance of the disease,

Speaker 2 [17:27]
yeah. So for that, you want to look at figure four. Okay. This figure shows how prism can generate these realistic changes related to specific diseases, okay, like pleural effusion and cardiomegaly, right? They even provide these difference maps, okay, which highlight the areas in the image where the modification happened. Okay, so you can see the fluid from pleural effusion being removed, okay, or the heart size decreasing in the cardiomegaly examples, okay. And the key thing here is that all the surrounding anatomy, yeah, all the other stuff in the image, it stays consistent.

Speaker 1 [18:03]
So it's it's changing it in a way that makes sense Exactly. It's not

Speaker 2 [18:06]
just like cutting and pasting, right? It's making changes that are anatomically plausible, yeah, which is super important for medical applications, absolutely. And one of the most exciting findings, in my opinion, is in Table three, okay? Table three. What they did was they took an existing medical image classifier, okay, and they added counterfactual images generated by prism to the training data. And the result, yeah, the classifier got significantly better at diagnosing several conditions, oh, wow, things like pleural effusion, cardiomegaly, right? And even the presence of those support devices. So

Speaker 1 [18:45]
it's almost like it by training on these What if images? Yeah, exactly. It's able to pick out the real features of the disease.

Speaker 2 [18:51]
Yes, it's learning the true signal, yeah, independent of all that noise and confounding factors. Yeah, they even compared it to just adding generic images generated by stable diffusion to the training data, and the improvements were much smaller. Oh, wow. So that really highlights the specific value of prisms targeted approach.

Speaker 1 [19:11]
So it's not just about creating cool images, it's about actually improving the AI systems downstream, making

Speaker 2 [19:17]
those diagnoses more accurate. So any other kind of deeper insights that came out of this research? Oh, there are a few more interesting things. Okay, so in Appendix F, they talk about how they validated the image modifications using something called v Q A, models. V Q A, what is that v Q A stands for visual question answering. Oh, okay, so these are models like, for example, Claude, 3.5 sonnet and Lavi Med, okay, these are state of the art AI systems, Oh, yeah. And what they found was that these V QA models could correctly identify the presence or absence of medical conditions in both the original images and the counterfactual images generated by prism. So it's almost like getting confirmation from another AI, yes, like a second opinion from a. Really smart colleague, and then in Appendix G, there's this fascinating observation about cardiomegaly and pacemakers. Okay? So they noticed that when prism generated a counterfactual image without cardiomegaly, it often also removed the pacemaker, but it didn't do this when generating an image without pleural effusion, which suggests that the model is actually learning these complex relationships, right, that exist in the medical data.

Speaker 1 [20:28]
It's understanding that those two things are often seen together exactly.

Speaker 2 [20:31]
It's not just blindly following the text prompts. It's picking up on these patterns, yeah. And finally, in Appendix H, they talk about some of the really tough scenarios, okay, like trying to remove devices that are hidden by shadows, yeah, or that are overlapping with bones, yeah? And they found that prism was surprisingly good at handling these tricky cases. Oh, okay. It was able to fill in those missing parts of the image with anatomically realistic structures really. Yeah, it's pretty remarkable.

Unknown Speaker [21:00]
Okay, so of course, no technology is perfect, yeah, there's

Speaker 1 [21:03]
always room for improvement. What are some of the limitations? Okay,

Speaker 2 [21:07]
so in Appendix i They outline a few limitations. Okay, one is that prism, like other models that use stable diffusion, can have a bit of trouble accurately reproducing those tiny text labels, okay, that you sometimes see on medical radiographs, especially if they're in the corners, right. And then in very complex cases where there are a lot of overlap between devices and anatomy, yeah, or if the original image is already distorted in some way, right, the generated edits might not always be exactly what you expect, but overall, it's a very promising technology. Yeah,

Speaker 1 [21:42]
it's important to, you know, present both sides of it. Oh, absolutely gotta be realistic. So for our listener, yeah, what is the kind of the key takeaway from this deep dive? Well,

Speaker 2 [21:51]
I think the big takeaway is that prism is a major step forward in how we can use AI to understand and address bias in medical imaging data, yeah. And by creating these high resolution, precise, counter factual images using language as a guide, right, it's giving us this powerful new tool for developing more reliable AI systems for use in healthcare, yeah,

Speaker 1 [22:15]
so what are the implications of this for for the real world? Oh, the implications

Speaker 2 [22:18]
are huge. Okay. I mean, prism could be a game changer for developing better diagnostic AI models, okay, models that aren't so easily fooled by those misleading correlations, right? It can also help researchers understand diseases better by letting them isolate and manipulate specific features in the images. Yeah, and in the future, you know, down the line? Yeah, I could even see it being used for personalized medicine. Oh, how so? Well, imagine clinicians being able to explore these what if scenarios, right related to a patient's specific medical images to help them make more informed treatment

Speaker 1 [22:56]
decisions? Yeah, that's

Speaker 1 [22:58]
That's really incredible to think about. It's exciting stuff. So for you know, our listeners out there who are in this field, yeah, in the medical imaging AI field, yeah. Want to learn more.