PerceptionComp: A Video Benchmark for
Complex Perception-Centric Reasoning

1Tsinghua University    2University of Washington    3Nanyang Technological University
* Equal contribution     # Project co-lead     Equal advising
1,114 Five-Choice Questions
279 High-Complexity Videos
7 Video Categories
10–20min Per Question Annotation
100% Manual Annotation

Benchmark Overview

PerceptionComp benchmark overview figure

Leaderboard

Comprehensive evaluation across video categories and difficulty levels  ·  Chance level = 20%

Table is sortable by clicking column headers. Download data for further analysis.

Abstract

Deep video understanding requires long-horizon, perception-centric reasoning that repeatedly revisits a video to gather temporally distributed evidence. However, existing benchmarks are either relatively easy (perception-centric but often solvable after a single view) or logic-heavy with simplified visuals, and thus do not faithfully measure multimodal test-time thinking that depends on repeated perception.

We introduce PerceptionComp, a fully manually annotated benchmark designed so that no single moment is sufficient: answering requires evidence from multiple temporally separated segments under compositional constraints. PerceptionComp contains 1,114 five-choice questions over 279 high-scene-complexity videos spanning diverse domains. Videos are selected using automatic proxies for scene complexity (SAM2 instance counts and optical-flow magnitude), and each question requires 10–20 minutes of annotation.

Human evaluation confirms the intended difficulty: PerceptionComp requires substantially longer response times than prior benchmarks, and under a single-view setting (no rewatching) human accuracy drops to near chance (18.97%), while experts can reach 100% accuracy with unrestricted rewatching and sufficient time. State-of-the-art MLLMs perform notably worse: the best model (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%.

Key Findings & Results

Important insights from PerceptionComp evaluation

Language Reasoning ≠ Perceptual Reasoning

Stronger language-side thinking does not automatically improve perception-driven video reasoning. Qwen3-VL thinking variants sometimes underperform their instruction-tuned counterparts when perceptual evidence is misread.

Test-Time Reasoning Helps (+11%)

GPT-o3 surpasses GPT-4o by 11.04%; Gemini-2.5-Pro exceeds Gemini-2.5-Flash by 6.19%. Reasoning is beneficial but far from closing the gap to human-level performance.

Spatial Understanding is the Bottleneck

60% of mid-chain failures are attributed to violated spatial subconditions. Models anchor on objects matching identity keywords but violating critical spatial or temporal constraints.

Mid-Chain Collapse (Peak at Step 3: 40%)

Once an intermediate entity is wrong, the remaining chain drifts from ground truth while staying internally coherent — especially severe for sequential questions where later conditions depend on earlier results.

Frontier Models Cluster in the Mid-40s

Gemini-3 variants and GPT-o3 all cluster around 43–46% despite different architectures, suggesting a fundamental bottleneck in perception-centric long-horizon reasoning.

Scale Alone Does Not Solve It

Qwen3-VL 235B ≈ Qwen3-VL 8B ≈ 34% overall. The benchmark requires reliable perceptual extraction under clutter, not merely larger generic capacity.

Example Demonstrations

Sample analysis from PerceptionComp videos

Demo Example 1

Example 1

Demo Example 2

Example 2

Demo Example 3

Example 3

Demo Example 4

Example 4

Benchmark Design

Video Selection

SAM2 + optical flow complexity

Subcondition Design

Semantic · Spatial · Temporal · Correspondence

Question Assembly

Conjunctive or Sequential

Annotation & Verification

Dual-annotator check · 89% agreement

Video Categories

We select videos with high scene and object complexity across seven diverse real-world categories (2–10 min clips). Complexity is measured via SAM2 instance counts and optical-flow magnitude.

City Walk Tours

Dense pedestrian traffic, complex street scenes, frequent camera motion

Shopping in Malls

Crowded indoor environments with stores, signs, and many objects

Sports Competitions

Fast motion, multiple athletes, dynamic scenes and scene transitions

Indoor Villa Tours

Large indoor spaces with rich object arrangements across many rooms

Variety Shows

Entertainment shows with multiple people, events, and stage changes

Movie Clips

Film excerpts with complex narratives and rich visual detail

Game Livestreams

Screen-captured games with naturally occurring clutter and dynamics

Perceptual Skills

Each question combines subconditions probing distinct skills. Solving a question requires their coordinated use, not any single narrow competence.

  Semantic Understanding

Recognize object categories, attributes (shape, color, material), and higher-level relations such as roles or interactions.

  Spatial Understanding

Reason about scene layout and relative geometry — left/right, front/behind, near/far, occlusion, and 3D spatial relations.

  Temporal Understanding

Follow motion patterns, localize events in time (before/after a reference event), and reason about event ordering.

  Correspondence

Match instances across time and views — tracking objects across shots, part–whole matching, re-identification after occlusion.

BibTeX

@article{li2026perceptioncomp,
  title   = {{PerceptionComp}: A Video Benchmark for Complex Perception-Centric Reasoning},
  author  = {Li, Shaoxuan and Zhao, Zhixuan and Deng, Hanze and Ma, Zirun and
             Tian, Shulin and Liu, Zuyan and Hu, Yushi and Wu, Haoning and
             Dong, Yuhao and Liu, Benlin and Liu, Ziwei and Krishna, Ranjay},
  journal = {arXiv preprint arXiv:2603.26653},
  year    = {2026}
}