Rohit Girdhar

Kiran K. Somasundaram

Giovanni Maria Farinella

IEEE Trans. Pattern Anal. Mach. Intell., November, 2025

Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective.

[BibT_eX]

[DOI]

CoRR, November, 2025

Diffusion Autoencoders are Scalable Image Tokenizers.

[BibT_eX]

[DOI]

CoRR, January, 2025

LLMs can see and hear without any training.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

MotiF: Making Text Count in Image Animation with Motion Focal Loss.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

Human Action Anticipation: A Survey.

[BibT_eX]

[DOI]

CoRR, 2024

Factorizing Text-to-Video Generation by Explicit Image Conditioning.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Generating Illustrated Instructions.

[BibT_eX]

[DOI]

Sachit Menon

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

InstanceDiffusion: Instance-Level Control for Image Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Motion-Conditioned Image Animation for Video Editing.

[BibT_eX]

[DOI]

CoRR, 2023

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning.

[BibT_eX]

[DOI]

CoRR, 2023

Learning to Substitute Ingredients in Recipes.

[BibT_eX]

[DOI]

Adriana Romero-Soriano

CoRR, 2023

What You Say Is What You Show: Visual Narration Detection in Instructional Videos.

[BibT_eX]

[DOI]

CoRR, 2023

The effectiveness of MAE pre-pretraining for billion-scale pretraining.

[BibT_eX]

[DOI]

Mannat Singh

Quentin Duval

Kalyan Vasudev Alwala

Ross B. Girshick

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

OmniMAE: Single Model Masked Pretraining on Images and Videos.

[BibT_eX]

[DOI]

Alaaeldin El-Nouby

Mannat Singh

Kalyan Vasudev Alwala

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

ImageBind One Embedding Space to Bind Them All.

[BibT_eX]

[DOI]

Kalyan Vasudev Alwala

Santhosh Kumar Ramakrishnan

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

HierVL: Learning Hierarchical Video-Language Embeddings.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Cut and Learn for Unsupervised Object Detection and Instance Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Learning Video Representations from Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Detecting Twenty-Thousand Classes Using Image-Level Supervision.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Ego4D: Around the World in 3, 000 Hours of Egocentric Video.

[BibT_eX]

[DOI]

Kiran K. Somasundaram

Giovanni Maria Farinella

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Omnivore: A Single Model for Many Visual Modalities.

[BibT_eX]

[DOI]

Mannat Singh

Nikhila Ravi

Laurens van der Maaten

Santhosh Kumar Ramakrishnan

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Masked-attention Mask Transformer for Universal Image Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Mask2Former for Video Instance Segmentation.

[BibT_eX]

[DOI]

CoRR, 2021

Ego4D: Around the World in 3, 000 Hours of Egocentric Video.

[BibT_eX]

[DOI]

Kiran K. Somasundaram

Giovanni Maria Farinella

CoRR, 2021

Physical Reasoning Using Dynamics-Aware Models.

[BibT_eX]

[DOI]

Eltayeb Ahmed

Anton Bakhtin

Laurens van der Maaten

CoRR, 2021

Self-Supervised Pretraining of 3D Features on any Point-Cloud.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

An End-to-End Transformer Model for 3D Object Detection.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Anticipative Video Transformer.

[BibT_eX]

[DOI]

Kristen Grauman

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

3D Spatial Recognition Without Spatially Labeled 3D.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Forward Prediction for Physical Reasoning.

[BibT_eX]

[DOI]

Laura Gustafson

Aaron Adcock

Laurens van der Maaten

CoRR, 2020

MetaPix: Few-Shot Video Retargeting.

[BibT_eX]

[DOI]

Jessica Lee

Proceedings of the 8th International Conference on Learning Representations, 2020

CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 8th International Conference on Learning Representations, 2020

2019

Learning to Understand People via Local, Global and Temporal Reasoning.

[BibT_eX]

[DOI]

PhD thesis, 2019

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning.

[BibT_eX]

[DOI]

CoRR, 2019

Are we Asking the Right Questions in MovieQA?

[BibT_eX]

[DOI]

Bhavan Jasani

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, 2019

DistInit: Learning Video Representations Without a Single Labeled Video.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Video Action Transformer Network.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

A Better Baseline for AVA.

[BibT_eX]

[DOI]

CoRR, 2018

Detect-and-Track: Efficient Pose Estimation in Videos.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017

Attentional Pooling for Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

Binge Watching: Scaling Affordance Learning from Sitcoms.

[BibT_eX]

[DOI]

Xiaolong Wang

Abhinav Gupta

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

2016

Cutting through the clutter: Task-relevant features for image matching.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, 2016

Learning a Predictable and Generative Vector Representation for Objects.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2016, 2016

2014

Optimizing Storage Intensive Vision Applications to Device Capacity.

[BibT_eX]

[DOI]