Christoph Feichtenhofer

Ann Lee

Wei-Ning Hsu

CoRR, December, 2025

SAM Audio: Segment Anything in Audio.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

Piotr Dollár

Wei-Ning Hsu

Ann Lee

CoRR, December, 2025

Ego4D: Around the World in 3,600 Hours of Egocentric Video.

[BibT_eX]

[DOI]

Kiran K. Somasundaram

Giovanni Maria Farinella

IEEE Trans. Pattern Anal. Mach. Intell., November, 2025

SAM 3: Segment Anything with Concepts.

[BibT_eX]

[DOI]

CoRR, November, 2025

Perception Encoder: The best visual embeddings are not at the output of the network.

[BibT_eX]

[DOI]

CoRR, April, 2025

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding.

[BibT_eX]

[DOI]

CoRR, April, 2025

Gaussian Masked Autoencoders.

[BibT_eX]

[DOI]

Xinlei Chen

Rulilong Li

Shiry Ginosar

CoRR, January, 2025

SAM 2: Segment Anything in Images and Videos.

[BibT_eX]

[DOI]

Kalyan Vasudev Alwala

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

An Empirical Study of Autoregressive Pre-Training from Videos.

[BibT_eX]

[DOI]

Ilija Radosavovic

Rahul Ravishankar

Yossi Gandelsman

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2024

Window Attention is Bugged: How not to Interpolate Position Embeddings.

[BibT_eX]

[DOI]

Daniel Bolya

Chaitanya Ryali

Judy Hoffman

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Demystifying CLIP Data.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Altogether: Image Captioning via Re-aligning Alt-text.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2023

MAViL: Masked Audio-Video Learners.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Token Merging: Your ViT But Faster.

[BibT_eX]

[DOI]

Judy Hoffman

Proceedings of the Eleventh International Conference on Learning Representations, 2023

The effectiveness of MAE pre-pretraining for billion-scale pretraining.

[BibT_eX]

[DOI]

Mannat Singh

Quentin Duval

Kalyan Vasudev Alwala

Ross B. Girshick

Rohit Girdhar

Ishan Misra

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Diffusion Models as Masked Autoencoders.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

CiT: Curation in Training for Effective Vision-Language Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Multiview Compressive Coding for 3D Reconstruction.

[BibT_eX]

[DOI]

Justin Johnson

Georgia Gkioxari

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

On the Benefits of 3D Pose and Tracking for Human Action Recognition.

[BibT_eX]

[DOI]

Georgios Pavlakos

Angjoo Kanazawa

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Scaling Language-Image Pre-Training via Masking.

[BibT_eX]

[DOI]

Yanghao Li

Ronghang Hu

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Masked Autoencoders As Spatiotemporal Learners.

[BibT_eX]

[DOI]

Yanghao Li

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Masked Autoencoders that Listen.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

TrackFormer: Multi-Object Tracking with Transformers.

[BibT_eX]

[DOI]

Tim Meinhardt

Alexander Kirillov

Laura Leal-Taixé

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Reversible Vision Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Ego4D: Around the World in 3, 000 Hours of Egocentric Video.

[BibT_eX]

[DOI]

Kiran K. Somasundaram

Giovanni Maria Farinella

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Masked Feature Prediction for Self-Supervised Visual Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

A ConvNet for the 2020s.

[BibT_eX]

[DOI]

Zhuang Liu

Hanzi Mao

Trevor Darrell

Saining Xie

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Improved Multiscale Vision Transformers for Classification and Detection.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

CoRR, 2021

Ego4D: Around the World in 3, 000 Hours of Egocentric Video.

[BibT_eX]

[DOI]

Kiran K. Somasundaram

Giovanni Maria Farinella

CoRR, 2021

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers.

[BibT_eX]

[DOI]

Andrea Vedaldi

João F. Henriques

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

PyTorchVideo: A Deep Learning Library for Video Understanding.

[BibT_eX]

[DOI]

Tullie Murrell

Heng Wang

Kalyan Vasudev Alwala

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Multiview Pseudo-Labeling for Semi-supervised Learning from Video.

[BibT_eX]

[DOI]

Bo Xiong

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Multiscale Vision Transformers.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding.

[BibT_eX]

[DOI]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding.

[BibT_eX]

[DOI]

Florian Metze

Luke Zettlemoyer

Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, 2021

2020

Modeling Human Motion with Quaternion-Based Neural Networks.

[BibT_eX]

[DOI]

Dario Pavllo

Michael Auli

David Grangier

Int. J. Comput. Vis., 2020

Deep Insights into Convolutional Networks for Video Recognition.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2020

Feature Pyramid Grids.

[BibT_eX]

[DOI]

CoRR, 2020

Audiovisual SlowFast Networks for Video Recognition.

[BibT_eX]

[DOI]

CoRR, 2020

A Multigrid Method for Efficiently Training Video Models.

[BibT_eX]

[DOI]

Ross B. Girshick

Philipp Krähenbühl

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Ego-Topo: Environment Affordances From Egocentric Video.

[BibT_eX]

[DOI]

Tushar Nagarajan

Yanghao Li

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

X3D: Expanding Architectures for Efficient Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019

Grounded Human-Object Interaction Hotspots from Video (Extended Abstract).

[BibT_eX]

[DOI]

Tushar Nagarajan

CoRR, 2019

Learning Temporal Pose Estimation from Sparsely-Labeled Videos.

[BibT_eX]

[DOI]

Gedas Bertasius

Du Tran

Jianbo Shi

Lorenzo Torresani

Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Grounded Human-Object Interaction Hotspots From Video.

[BibT_eX]

[DOI]

Tushar Nagarajan

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

SlowFast Networks for Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Long-Term Feature Banks for Detailed Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training.

[BibT_eX]

[DOI]

Dario Pavllo

David Grangier

Michael Auli

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

Learning Discriminative Motion Features Through Detection.

[BibT_eX]

[DOI]

Gedas Bertasius

Du Tran

Jianbo Shi

Lorenzo Torresani

CoRR, 2018

Camera-based vehicle velocity estimation from monocular video.

[BibT_eX]

[DOI]

Moritz Kampelmühler

Michael G. Müller

CoRR, 2018

What Have We Learned From Deep Representations for Action Recognition?

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017

Detect to Track and Track to Detect.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

Spatiotemporal Multiplier Networks for Video Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

Temporal Residual Networks for Dynamic Scene Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

2016

Dynamic Scene Recognition with Complementary Spatiotemporal Features.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., 2016

Spatiotemporal Residual Networks for Video Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016

Convolutional Two-Stream Network Fusion for Video Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

2015

Dynamically encoded actions based on spacetime saliency.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

2014

Fusing RFID and computer vision for probabilistic tag localization.

[BibT_eX]

[DOI]

Michael Goller

Proceedings of the IEEE International Conference on RFID, 2014

Bags of Spacetime Energies for Dynamic Scene Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014

2013

A Perceptual Image Sharpness Metric Based on Local Edge Gradient Analysis.

[BibT_eX]

[DOI]

Hannes Fassold

Peter Schallauer

IEEE Signal Process. Lett., 2013

Spatio-temporal Good Features to Track.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, 2013

Spacetime Forests with Complementary Features for Dynamic Scene Recognition.

[BibT_eX]

[DOI]