PIKACHU — MIDL 2026

Abstract

Medical imaging systems increasingly rely on large vision-language foundation models (VLFMs) trained on diverse biomedical corpora, yet these models remain difficult to adapt to new clinical tasks without costly fine-tuning and large annotated datasets. We present PIKACHU, a lightweight and generalizable framework that enables rapid few-shot adaptation of frozen medical FMs using only a handful of labeled examples. PIKACHU performs all task adaptation directly in the FM feature space through in-context prototypical reasoning — constructing class prototypes by averaging normalized embeddings and performing prediction via temperature-scaled cosine similarity. Only a single temperature parameter is learned.

Method

PIKACHU operates entirely in the frozen foundation model feature space. Given a small support set of labeled images, the framework classifies a query image through three sequential stages:

Stage 01

Feature Extraction

Query and support images are encoded by a frozen pretrained backbone. Embeddings are L2-normalized for geometric stability.

Stage 02

Prototype Computation

Class prototypes are formed by mean-pooling the normalized support embeddings for each class, then re-normalizing.

Stage 03

Similarity & Prediction

Temperature-scaled cosine similarity between query and prototypes yields class probabilities via softmax. Predicted label is the argmax.

The only trainable parameter is the log-temperature τ, optimized with Adam (lr = 1×10⁻⁴) over episodic few-shot episodes. Backbone weights are never modified, eliminating catastrophic forgetting and enabling CPU-feasible inference.

Evaluated Backbones

PIKACHU is backbone-agnostic. We evaluate four pretrained vision encoders to assess generality across pretraining objectives:

PubMedCLIP

Vision-language model pretrained on large-scale biomedical image–text pairs. Strong domain-specific baseline.

SigLIP

Contrastive VLM trained on massive web-scale data using sigmoid loss. Strong semantic generalization.

DINOv2

Self-supervised model via knowledge distillation. Highly transferable features across diverse downstream tasks.

ViT-Base

Standard vision transformer trained on generic image datasets. Neutral baseline without domain-specific biases.

Datasets

Three heterogeneous medical imaging datasets are used for evaluation, each formulated as balanced binary classification:

ISIC — Skin

Melanoma vs. Nevi

4,522 test samples per class. Dermatoscopic images with high intra-class variability in color, texture, and morphology.

OCT — Retina

CNV vs. Normal

1,000 test samples per class. Cross-sectional retinal scans for diagnosing choroidal neovascularization.

DR — Fundus

No DR vs. Proliferative DR

1,466 test samples per class. Fundus photographs graded for diabetic retinopathy severity.

Results — Baseline vs. ICL

ICL consistently and substantially outperforms the zero-shot baseline across every backbone and dataset. Improvements are most dramatic on OCT and DR.

Model	Strategy	ISIC (Acc.)	OCT (Acc.)	DR (Acc.)
SigLIP	Baseline	0.49	0.50	0.50
SigLIP	PIKACHU (ICL)	0.73	0.83	0.77
PubMedCLIP	Baseline	0.50	0.50	0.40
PubMedCLIP	PIKACHU (ICL)	0.69	0.72	0.79
DINOv2	Baseline	0.50	0.51	0.76
DINOv2	PIKACHU (ICL)	0.74	0.82	0.80
ViT	Baseline	0.50	0.52	0.39
ViT	PIKACHU (ICL)	0.69	0.83	0.81

+33pp

Max accuracy gain on OCT (SigLIP: 0.50 → 0.83)

+39pp

Max accuracy gain on DR (ViT: 0.39 → 0.81 / PubMedCLIP: 0.40 → 0.79)

K=5

Default support set size; steepest gain seen between K=1 and K=5

Trainable parameter — the temperature τ only

Comparison with PEFT Methods (SigLIP backbone)

PIKACHU matches or equals the best-performing methods while using orders of magnitude fewer parameters.

Method	ISIC	OCT	DR	Trainable Params
Tip-Adapter	0.71	0.78	0.74	1
Proto-Adapter	0.73	0.82	0.76	99,137
LoRA (rank=8)	0.72	0.83	0.76	12,289
LoRA (rank=16)	0.73	0.82	0.77	24,577
PIKACHU	0.73	0.83	0.77	1

Code & Reproducibility

All experiments are implemented in PyTorch. Pretrained weights are loaded from public model hubs and cached locally. All experiments run on a single NVIDIA H100 GPU, though CPU inference is feasible.

→ github.com/Amarkr1/pikachu