SEER: Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

Abstract

Free-text promptable 3D medical image segmentation offers an intuitive and clinically flexible interaction paradigm. However, current methods are highly sensitive to linguistic variability: minor changes in phrasing can cause substantial performance degradation despite identical clinical intent. Existing approaches attempt to improve robustness through stronger vision-language fusion or larger vocabularies, yet they lack mechanisms to consistently align ambiguous free-form expressions with anatomically grounded representations.

We propose Skill-Evolving groundEd Reasoning (SEER), a novel framework for free-text promptable 3D medical image segmentation that explicitly bridges linguistic variability and anatomical precision through a reasoning-driven design. First, we curate the SEER-Trace dataset, which pairs raw clinical requests with image-grounded, skill-tagged reasoning traces, establishing a reproducible benchmark. Second, SEER constructs an evidence-aligned target representation via a vision-language reasoning chain that verifies clinical intent against image-derived anatomical evidence, thereby enforcing semantic consistency before voxel-level decoding. Third, we introduce SEER-Loop, a dynamic skill-evolving strategy that distills high-reward reasoning trajectories into reusable skill artifacts and progressively integrates them into subsequent inference, enabling structured self-refinement and improved robustness to diverse linguistic expressions.

Extensive experiments demonstrate superior performance of SEER over state-of-the-art baselines. Under linguistic perturbations, SEER reduces performance variance by 81.94% and improves worst-case Dice by 18.60%.

Dataset	Method	Label Prompting	Free-text Prompting
BrainMetShare	SAT	22.16	0.69	0.00	2.53
BiomedParseV2	18.66	2.53	0.00	7.27
Text3DSAM	0.10	0.41	0.00	0.93
MedSAM3	11.33	16.62	10.56	5.17
VoxTell	48.19	52.15	46.71	3.35
SEER (Ours)	51.70	53.83	51.44	1.67
PENGWIN	SAT	96.05	0.01	0.00	0.13
BiomedParseV2	1.35	8.53	0.00	7.50
Text3DSAM	24.75	0.01	0.00	0.16
MedSAM3	18.26	5.75	3.67	6.40
VoxTell	97.59	92.26	79.34	7.49
SEER (Ours)	97.56	97.39	95.47	0.98

Dataset

Method

Label Prompting

Free-text Prompting

Dice ↑

Worst Dice ↑

Std. ↓

BrainMetShare

SAT

22.16

0.69

0.00

2.53

BiomedParseV2

18.66

2.53

0.00

7.27

Text3DSAM

0.10

0.41

0.00

0.93

MedSAM3

11.33

16.62

10.56

5.17

VoxTell

48.19

52.15

46.71

3.35

SEER (Ours)

51.70

53.83

51.44

1.67

PENGWIN

SAT

96.05

0.01

0.00

0.13

BiomedParseV2

1.35

8.53

0.00

7.50

Text3DSAM

24.75

0.01

0.00

0.16

MedSAM3

18.26

5.75

3.67

6.40

VoxTell

97.59

92.26

79.34

7.49

SEER (Ours)

97.56

97.39

95.47

0.98

BibTeX

@InProceedings{zhang2026seer, author = { Zhang, Tongrui and Wang, Chenhui and Li, Yongming and Chen, Zhihao and Zhan, Xufeng and Shan, Hongming}, title = { Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation }, booktitle = { Medical Image Computingand Computer Assisted Intervention }, year = { 2026 } }

SEER: Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

Different words. Same clinical intent. Stable 3D masks.
SEER makes free-text promptable 3D medical segmentation robust by grounding clinical language in image evidence and evolving reusable reasoning skills.

Abstract

Experiments and Findings

Evaluation datasets

Prompting protocols

Quantitative results

Qualitative comparisons

BibTeX

SEER: Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

Different words. Same clinical intent. Stable 3D masks. SEER makes free-text promptable 3D medical segmentation robust by grounding clinical language in image evidence and evolving reusable reasoning skills.

Abstract

Experiments and Findings

Evaluation datasets

Prompting protocols

Quantitative results

Qualitative comparisons

BibTeX

Different words. Same clinical intent. Stable 3D masks.
SEER makes free-text promptable 3D medical segmentation robust by grounding clinical language in image evidence and evolving reusable reasoning skills.