Video Understanding by Design:
How Datasets Shape Video Models

1Griffith University 2University of New South Wales
Dataset-induced representational biases across pretraining datasets

The same MotionFormer architecture develops different attention patterns when pretrained on datasets with different structures: context-heavy scenes, motion-centric transformations, and hand-object interactions.

The showcased videos are from OWM.

Abstract

Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While existing surveys often organize progress by tasks, benchmarks, or model families, they provide limited insight into why particular architectures emerged and succeeded.

This survey adopts a dataset-centric perspective: dataset structure shapes model design. Motion complexity, temporal span, compositional hierarchy, multi-agent interaction, and multimodal richness impose distinct learning challenges. These pressures naturally give rise to inductive biases for viewpoint robustness, temporal ordering, long-range dependency modeling, relational reasoning, and cross-modal alignment.

Dataset Properties

Motion, duration, interaction, composition, and modality define the learning signal.

Inductive Biases

Models favor different invariances and evidence patterns depending on the data regime.

Architectural Response

Model families can be understood as responses to evolving dataset requirements.

MotionFormer

Dataset-induced representational biases (final block)

Write video original sampled frames

Original Frames

MotionFormer Kinetics-400 write video block 12 attention GIF

Kinetics-400

MotionFormer Kinetics-600 write video block 12 attention GIF

Kinetics-600

MotionFormer Something-Something V2 write video block 12 attention GIF

Something-Something V2

MotionFormer EPIC-KITCHENS write video block 12 attention GIF

EPIC-KITCHENS-100

Dataset-induced biases across network depth (pre-training on Something-Something V2)

TimeSformer

Dataset-induced representational biases (final block)

Pour video original sampled frames

Original Frames

TimeSformer Kinetics-400 pour video block 12 attention GIF

Kinetics-400

TimeSformer Kinetics-600 pour video block 12 attention GIF

Kinetics-600

TimeSformer Something-Something V2 pour video block 12 attention GIF

Something-Something V2

Dataset-induced biases across network depth (pre-training on Something-Something V2)

Playground

Choose two model/pretrain/video groups. Every block is shown in order.

Takeaway

Aligning model design with dataset structure provides a practical way to explain past video-understanding progress and to design future systems that better handle motion, long-range temporal reasoning, interactions, and multimodal evidence.

BibTeX

Please use the following BibTeX entry to cite our work if you find it useful in your research.

@article{wang2026video,
  title={Video Understanding by Design: How Datasets Shape Video Models},
  author={Wang, Lei and Li, Syuan-Hao and Koniusz, Piotr and Gao, Yongsheng},
  journal={arXiv preprint arXiv:2509.09151},
  year={2026}
}