Abstract
Medical imaging systems increasingly rely on large vision-language foundation models (VLFMs) trained on diverse biomedical corpora, yet these models remain difficult to adapt to new clinical tasks without costly fine-tuning and large annotated datasets. We present PIKACHU, a lightweight and generalizable framework that enables rapid few-shot adaptation of frozen medical FMs using only a handful of labeled examples. PIKACHU performs all task adaptation directly in the FM feature space through in-context prototypical reasoning — constructing class prototypes by averaging normalized embeddings and performing prediction via temperature-scaled cosine similarity. Only a single temperature parameter is learned.
Method
PIKACHU operates entirely in the frozen foundation model feature space. Given a small support set of labeled images, the framework classifies a query image through three sequential stages:
Feature Extraction
Query and support images are encoded by a frozen pretrained backbone. Embeddings are L2-normalized for geometric stability.
Prototype Computation
Class prototypes are formed by mean-pooling the normalized support embeddings for each class, then re-normalizing.
Similarity & Prediction
Temperature-scaled cosine similarity between query and prototypes yields class probabilities via softmax. Predicted label is the argmax.
The only trainable parameter is the log-temperature τ, optimized with Adam (lr = 1×10⁻⁴) over episodic few-shot episodes. Backbone weights are never modified, eliminating catastrophic forgetting and enabling CPU-feasible inference.
Evaluated Backbones
PIKACHU is backbone-agnostic. We evaluate four pretrained vision encoders to assess generality across pretraining objectives:
PubMedCLIP
Vision-language model pretrained on large-scale biomedical image–text pairs. Strong domain-specific baseline.
SigLIP
Contrastive VLM trained on massive web-scale data using sigmoid loss. Strong semantic generalization.
DINOv2
Self-supervised model via knowledge distillation. Highly transferable features across diverse downstream tasks.
ViT-Base
Standard vision transformer trained on generic image datasets. Neutral baseline without domain-specific biases.
Datasets
Three heterogeneous medical imaging datasets are used for evaluation, each formulated as balanced binary classification:
Melanoma vs. Nevi
4,522 test samples per class. Dermatoscopic images with high intra-class variability in color, texture, and morphology.
CNV vs. Normal
1,000 test samples per class. Cross-sectional retinal scans for diagnosing choroidal neovascularization.
No DR vs. Proliferative DR
1,466 test samples per class. Fundus photographs graded for diabetic retinopathy severity.
Results — Baseline vs. ICL
ICL consistently and substantially outperforms the zero-shot baseline across every backbone and dataset. Improvements are most dramatic on OCT and DR.
| Model | Strategy | ISIC (Acc.) | OCT (Acc.) | DR (Acc.) |
|---|---|---|---|---|
| SigLIP | Baseline | 0.49 | 0.50 | 0.50 |
| PIKACHU (ICL) | 0.73 | 0.83 | 0.77 | |
| PubMedCLIP | Baseline | 0.50 | 0.50 | 0.40 |
| PIKACHU (ICL) | 0.69 | 0.72 | 0.79 | |
| DINOv2 | Baseline | 0.50 | 0.51 | 0.76 |
| PIKACHU (ICL) | 0.74 | 0.82 | 0.80 | |
| ViT | Baseline | 0.50 | 0.52 | 0.39 |
| PIKACHU (ICL) | 0.69 | 0.83 | 0.81 |
Max accuracy gain on OCT (SigLIP: 0.50 → 0.83)
Max accuracy gain on DR (ViT: 0.39 → 0.81 / PubMedCLIP: 0.40 → 0.79)
Default support set size; steepest gain seen between K=1 and K=5
Trainable parameter — the temperature τ only
Comparison with PEFT Methods (SigLIP backbone)
PIKACHU matches or equals the best-performing methods while using orders of magnitude fewer parameters.
| Method | ISIC | OCT | DR | Trainable Params |
|---|---|---|---|---|
| Tip-Adapter | 0.71 | 0.78 | 0.74 | 1 |
| Proto-Adapter | 0.73 | 0.82 | 0.76 | 99,137 |
| LoRA (rank=8) | 0.72 | 0.83 | 0.76 | 12,289 |
| LoRA (rank=16) | 0.73 | 0.82 | 0.77 | 24,577 |
| PIKACHU | 0.73 | 0.83 | 0.77 | 1 |
Code & Reproducibility
All experiments are implemented in PyTorch. Pretrained weights are loaded from public model hubs and cached locally. All experiments run on a single NVIDIA H100 GPU, though CPU inference is feasible.
→ github.com/Amarkr1/pikachu