From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Shuang Liang1,3,†, Zeqing Wang2,†, Yuxian Li1,†, Xihui Liu1, Han Wang1,3,*
1Department of Electrical and Computer Engineering, The University of Hong Kong
2School of Computer Science and Engineering, Sun Yat-sen University
3CASIC, The University of Hong Kong

Equal contribution. *Corresponding author.
Diverse CAFE test cases across SM, CC, and OC

A slice of the benchmark: counterfactual edits keep the evaluation mask fixed while varying attributes so positive and misleading negative prompts stress concept grounding, not just mask quality.

Abstract

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a benchmark for evaluating concept-faithful segmentation in promptable concept segmentation models. CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, surrounding context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples across Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding.

Three counterfactual settings

Superficial Mimicry (SM)

Surface appearance is edited so the target resembles another category while identity stays the same, stressing pattern vs. object.

Context Conflict (CC)

Surrounding environment suggests a plausible but wrong category (e.g., teddy bear in snow vs. polar bear) while the object identity is unchanged.

Ontological Conflict (OC)

Material or substance changes (e.g., airplane shape rendered as cloud) so the valid concept shifts; the misleading prompt exploits leftover global shape cues.

Task overview

CAFE task and evaluation overview

CAFE focuses on attribute-level counterfactual evaluation: the mask is fixed while attributes (appearance, context, or material) change to induce positive vs. misleading negative prompts, testing whether models ground concepts or follow visual shortcuts.

Leaderboard

Promptable Concept Segmentation (PCS) performance on CAFE. cgF1 is the main metric combining concept discrimination (IL_MCC) with mask quality (pmF1).

# Model Type SM CC OC Overall

SM = Superficial Mimicry (1,111), CC = Context Conflict (593), OC = Ontological Conflict (442). SAM3 uses default threshold 0.5; other models use thresholds calibrated via LVIS cgF1 sweep.

SAM3 + Agent Demo

Cases where SAM3 alone fails under counterfactual cues. The CAFE-SAM3 agent corrects these through multi-step reasoning, zoom-in inspection, and concept verification.

Test case image Counterfactual edited image
0/0

BibTeX

@misc{liang2026cafe,
  title={From Pixels to Concepts: Do Segmentation Models Understand What They Segment?},
  author={Shuang Liang and Zeqing Wang and Yuxian Li and Xihui Liu and Han Wang},
  year={2026},
  url={https://github.com/T-S-Liang/CAFE},
  note={Hugging Face dataset teemosliang/CAFE},
}