SEER: Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

1Fudan University, 2University of Science and Technology of China
Overview of the SEER framework

Different words. Same clinical intent. Stable 3D masks.
SEER makes free-text promptable 3D medical segmentation robust by grounding clinical language in image evidence and evolving reusable reasoning skills.

Abstract

Free-text promptable 3D medical image segmentation offers an intuitive and clinically flexible interaction paradigm. However, current methods are highly sensitive to linguistic variability: minor changes in phrasing can cause substantial performance degradation despite identical clinical intent. Existing approaches attempt to improve robustness through stronger vision-language fusion or larger vocabularies, yet they lack mechanisms to consistently align ambiguous free-form expressions with anatomically grounded representations.

We propose Skill-Evolving groundEd Reasoning (SEER), a novel framework for free-text promptable 3D medical image segmentation that explicitly bridges linguistic variability and anatomical precision through a reasoning-driven design. First, we curate the SEER-Trace dataset, which pairs raw clinical requests with image-grounded, skill-tagged reasoning traces, establishing a reproducible benchmark. Second, SEER constructs an evidence-aligned target representation via a vision-language reasoning chain that verifies clinical intent against image-derived anatomical evidence, thereby enforcing semantic consistency before voxel-level decoding. Third, we introduce SEER-Loop, a dynamic skill-evolving strategy that distills high-reward reasoning trajectories into reusable skill artifacts and progressively integrates them into subsequent inference, enabling structured self-refinement and improved robustness to diverse linguistic expressions.

Extensive experiments demonstrate superior performance of SEER over state-of-the-art baselines. Under linguistic perturbations, SEER reduces performance variance by 81.94% and improves worst-case Dice by 18.60%.

Experiments and Findings

We evaluate SEER on two out-of-distribution benchmarks and compare it with strong 3D medical segmentation baselines under both native label prompting and realistic free-text clinical prompting.

SOTA
Free-text prompting performance
SEER achieves the best Dice on both OOD benchmarks: 53.83 on BrainMetShare and 97.39 on PENGWIN.
−81.94%
Lower performance variance
Under linguistic perturbations, SEER substantially reduces prediction instability across prompt variants.
+18.60%
Higher worst-case Dice
SEER improves worst-case segmentation quality, showing stronger robustness to adverse prompt wording.

Evaluation datasets

MRI
BrainMetShare

Partial-OOD / domain-shift evaluation. Brain anatomy remains within the seen anatomical domain, while institutional sources and target labels are outside the training coverage.

CT
PENGWIN

Strict OOD evaluation. Both pelvic anatomy and pelvic bone target labels are absent from SEER-Trace reasoning supervision and target coverage.

Prompting protocols

1 Label prompting mode

Baselines are evaluated using their natively supported, predefined label prompting interfaces.

2 Free-text prompting mode

Baselines are evaluated using realistic, linguistically diverse clinical requests introduced in SEER-Trace.

Quantitative results

Performance comparison under label and free-text prompting modes. Higher Dice and Worst Dice are better; lower standard deviation indicates better robustness across free-text prompt variants.

Dataset Method Label Prompting Free-text Prompting
Dice ↑ Dice ↑ Worst Dice ↑ Std. ↓
BrainMetShare SAT 22.16 0.69 0.00 2.53
BiomedParseV2 18.66 2.53 0.00 7.27
Text3DSAM 0.10 0.41 0.00 0.93
MedSAM3 11.33 16.62 10.56 5.17
VoxTell 48.19 52.15 46.71 3.35
SEER (Ours) 51.70 53.83 51.44 1.67
PENGWIN SAT 96.05 0.01 0.00 0.13
BiomedParseV2 1.35 8.53 0.00 7.50
Text3DSAM 24.75 0.01 0.00 0.16
MedSAM3 18.26 5.75 3.67 6.40
VoxTell 97.59 92.26 79.34 7.49
SEER (Ours) 97.56 97.39 95.47 0.98

Qualitative comparisons

Under three clinically plausible free-text prompt variants, SEER produces consistent, on-target masks across all variants, reflecting evidence-aligned disambiguation from the image. In contrast, baseline outputs are highly prompt-sensitive in free-text prompting mode, frequently drifting to off-target regions or collapsing to degenerate predictions, such as empty or fragmented masks, under minor wording changes.

Qualitative comparison of SEER and baseline methods under three free-text prompt variants

BibTeX

@InProceedings{zhang2026seer,
      author    = { Zhang, Tongrui and Wang, Chenhui and Li, Yongming and Chen, Zhihao and Zhan, Xufeng and Shan, Hongming},
      title     = { Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation },
      booktitle = { Medical Image Computingand Computer Assisted Intervention },
      year      = { 2026 }
}