Video Understanding by Design: How Datasets Shape Video Models

Wang, Lei; Li, Syuan-Hao; Koniusz, Piotr; Gao, Yongsheng

Video Understanding by Design:
How Datasets Shape Video Models

Lei Wang¹ Syuan-Hao Li¹ Piotr Koniusz² Yongsheng Gao¹

¹Griffith University ²University of New South Wales

Paper Code

Dataset-induced representational biases across pretraining datasets

The same MotionFormer architecture develops different attention patterns when pretrained on datasets with different structures: context-heavy scenes, motion-centric transformations, and hand-object interactions.

The showcased videos are from OWM.

Abstract

Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While existing surveys often organize progress by tasks, benchmarks, or model families, they provide limited insight into why particular architectures emerged and succeeded.

This survey adopts a dataset-centric perspective: dataset structure shapes model design. Motion complexity, temporal span, compositional hierarchy, multi-agent interaction, and multimodal richness impose distinct learning challenges. These pressures naturally give rise to inductive biases for viewpoint robustness, temporal ordering, long-range dependency modeling, relational reasoning, and cross-modal alignment.

Dataset Properties

Motion, duration, interaction, composition, and modality define the learning signal.

Inductive Biases

Models favor different invariances and evidence patterns depending on the data regime.

Architectural Response

Model families can be understood as responses to evolving dataset requirements.

MotionFormer

Dataset-induced representational biases (final block)

Original Frames

Kinetics-400

Kinetics-600

Something-Something V2

MotionFormer EPIC-KITCHENS write video block 12 attention GIF

EPIC-KITCHENS-100

Dataset-induced biases across network depth (pre-training on Something-Something V2)

TimeSformer

Dataset-induced representational biases (final block)

Original Frames

Kinetics-400

Kinetics-600

Something-Something V2

Dataset-induced biases across network depth (pre-training on Something-Something V2)

Playground

Choose two model/pretrain/video groups. Every block is shown in order.

Group A Video A Group B Video B

Takeaway

Aligning model design with dataset structure provides a practical way to explain past video-understanding progress and to design future systems that better handle motion, long-range temporal reasoning, interactions, and multimodal evidence.

BibTeX

Please use the following BibTeX entry to cite our work if you find it useful in your research.

@article{wang2026video,
  title={Video Understanding by Design: How Datasets Shape Video Models},
  author={Wang, Lei and Li, Syuan-Hao and Koniusz, Piotr and Gao, Yongsheng},
  journal={arXiv preprint arXiv:2509.09151},
  year={2026}
}