CAFE — Counterfactual Attribute Factuality Evaluation

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Shuang Liang^1,3,†, Zeqing Wang^2,†, Yuxian Li^1,†, Xihui Liu¹, Han Wang^1,3,*

¹Department of Electrical and Computer Engineering, The University of Hong Kong
²School of Computer Science and Engineering, Sun Yat-sen University
³CASIC, The University of Hong Kong
^†Equal contribution. ^*Corresponding author.

Abstract

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a benchmark for evaluating concept-faithful segmentation in promptable concept segmentation models. CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, surrounding context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples across Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding.

Three counterfactual settings

Superficial Mimicry (SM)

Surface appearance is edited so the target resembles another category while identity stays the same, stressing pattern vs. object.

Context Conflict (CC)

Surrounding environment suggests a plausible but wrong category (e.g., teddy bear in snow vs. polar bear) while the object identity is unchanged.

Ontological Conflict (OC)

Material or substance changes (e.g., airplane shape rendered as cloud) so the valid concept shifts; the misleading prompt exploits leftover global shape cues.

Leaderboard

Promptable Concept Segmentation (PCS) performance on CAFE. cgF₁ is the main metric combining concept discrimination (IL_MCC) with mask quality (pmF₁).

#	Model	Type	SM	CC	OC	Overall

SM = Superficial Mimicry (1,111), CC = Context Conflict (593), OC = Ontological Conflict (442). SAM3 uses default threshold 0.5; other models use thresholds calibrated via LVIS cgF₁ sweep.

SAM3 + Agent Demo

Cases where SAM3 alone fails under counterfactual cues. The CAFE-SAM3 agent corrects these through multi-step reasoning, zoom-in inspection, and concept verification.

Counterfactual edited image

0/0

@article{liang2026pixels, title={From Pixels to Concepts: Do Segmentation Models Understand What They Segment?}, author={Liang, Shuang and Wang, Zeqing and Li, Yuxian and Liu, Xihui and Wang, Han}, journal={arXiv preprint arXiv:2605.09591}, year={2026} }