Chen Sun

CoRR, July, 2025

MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos.

[BibT_eX]

[DOI]

CoRR, June, 2025

Self-Adapting Improvement Loops for Robotic Learning.

[BibT_eX]

[DOI]

CoRR, June, 2025

Unified Autoregressive Visual Generation and Understanding with Continuous Tokens.

[BibT_eX]

[DOI]

CoRR, March, 2025

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

Learning Visual Grounding from Generative Vision and Language Model.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

Force Prompting: Video Generation Models Can Learn And Generalize Physics-based Control Signals.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

Dense Video Object Captioning from Disjoint Supervision.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Solving New Tasks by Adapting Internet Video Knowledge.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Fourier Head: Helping Large Language Models Learn Complex Probability Distributions.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

How Can Objects Help Video-Language Understanding?

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

MotiF: Making Text Count in Image Animation with Motion Focal Loss.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Motion Prompting: Controlling Video Generation with Motion Trajectories.

[BibT_eX]

[DOI]

Tatiana Lopez-Guevara

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

Motion Prompting: Controlling Video Generation with Motion Trajectories.

[BibT_eX]

[DOI]

Tatiana Lopez-Guevara

CoRR, 2024

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources.

[BibT_eX]

[DOI]

CoRR, 2024

Fourier Head: Helping Large Language Models Learn Complex Probability Distributions.

[BibT_eX]

[DOI]

CoRR, 2024

Do Pre-trained Vision-Language Models Encode Object States?

[BibT_eX]

[DOI]

CoRR, 2024

Object-centric Video Representation for Long-term Action Anticipation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Text-Aware Diffusion for Policy Learning.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Do Music Generation Models Encode Music Theory?

[BibT_eX]

[DOI]

Proceedings of the 25th International Society for Music Information Retrieval Conference, 2024

Potential Based Diffusion Motion Planning.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Self-Correcting Self-Consuming Loops for Generative Model Training.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Vamos: Versatile Action Models for Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Pixel Aligned Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

End-to-End Spatio-Temporal Action Localisation with Video Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2023

Towards A Unified Neural Architecture for Visual Recognition and Reasoning.

[BibT_eX]

[DOI]

CoRR, 2023

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

[BibT_eX]

[DOI]

CoRR, 2023

Goal-Conditioned Predictive Coding as an Implicit Planner for Offline Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, 2023

AVIS: Autonomous Visual Information Seeking with Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Comparing Trajectory and Vision Modalities for Verb Representation.

[BibT_eX]

[DOI]

Dylan Ebert

CoRR, 2023

Steerable Equivariant Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2023

Goal-Conditioned Predictive Coding for Offline Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

AVIS: Autonomous Visual Information Seeking with Large Language Model Agent.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Does Visual Pretraining Help End-to-End Reasoning?

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Emergence of Abstract State Representations in Embodied Sequence Modeling.

[BibT_eX]

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Analyzing Modular Approaches for Visual Question Decomposition.

[BibT_eX]

[DOI]

Apoorv Khandelwal

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

How can objects help action recognition?

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency.

[BibT_eX]

[DOI]

CoRR, 2022

Beyond Transfer Learning: Co-finetuning for Action Localisation.

[BibT_eX]

[DOI]

CoRR, 2022

Do Vision-Language Pretrained Models Learn Primitive Concepts?

[BibT_eX]

[DOI]

CoRR, 2022

Masking Modalities for Cross-modal Video Retrieval.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

Do Trajectories Encode Verb Meaning?

[BibT_eX]

[DOI]

Dylan Ebert

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

AVATAR: Unconstrained Audiovisual Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Learning Audio-Video Modalities from Image Captions.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Multiview Transformers for Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Local Metrics for Multi-Object Tracking.

[BibT_eX]

[DOI]

Cristian Sminchisescu

Cordelia Schmid

CoRR, 2021

Attention Bottlenecks for Multimodal Fusion.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Episodic Transformer for Vision-and-Language Navigation.

[BibT_eX]

[DOI]

Alexander Pashevich

Cordelia Schmid

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets.

[BibT_eX]

[DOI]

Junru Gu

Hang Zhao

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Learning Temporal Dynamics from Cycles in Narrated Video.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Unified Graph Structured Models for Video Understanding.

[BibT_eX]

[DOI]

Anurag Arnab

Cordelia Schmid

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

ViViT: A Video Vision Transformer.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Composable Augmentation Encoding for Video Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Does Vision-and-Language Pretraining Improve Lexical Grounding?

[BibT_eX]

[DOI]

Tian Yun

Krishnamurthy Viswanathan

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020).

[BibT_eX]

[DOI]

CoRR, 2020

Learning Video Representations from Textual Web Supervision.

[BibT_eX]

[DOI]

CoRR, 2020

D3D: Distilled 3D Networks for Video Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2020

What Makes for Good Views for Contrastive Learning?

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Multi-modal Transformer for Video Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

Speech2Action: Cross-Modal Supervision for Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

TNT: Target-driven Trajectory Prediction.

[BibT_eX]

[DOI]

Balakrishnan Varadarajan

Proceedings of the 4th Conference on Robot Learning, 2020

2019

Contrastive Bidirectional Transformer for Temporal Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2019

Unsupervised learning of object structure and dynamics from videos.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Unsupervised Discovery of Parts, Structure, and Dynamics.

[BibT_eX]

[DOI]

Proceedings of the 7th International Conference on Learning Representations, 2019

Stochastic Prediction of Multi-Agent Interactions from Partial Observations.

[BibT_eX]

[DOI]

Proceedings of the 7th International Conference on Learning Representations, 2019

VideoBERT: A Joint Model for Video and Language Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Composing Text and Image for Image Retrieval - an Empirical Odyssey.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Relational Action Forecasting.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Inferring Context from Pixels for Multimodal Image Classification.

[BibT_eX]

[DOI]

Manan Shah

Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019

2018

DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks.

[BibT_eX]

[DOI]

Unaiza Ahsan

Sudheendra Vijayanarasimhan

Irfan A. Essa

CoRR, 2018

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2018, 2018

Actor-Centric Relation Network.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2018, 2018

The INaturalist Species Classification and Detection Dataset.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017

Rethinking Spatiotemporal Feature Learning For Video Understanding.

[BibT_eX]

[DOI]

CoRR, 2017

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions.

[BibT_eX]

[DOI]

Chunhui Gu

Sudheendra Vijayanarasimhan

CoRR, 2017

Complex Event Recognition from Images with Few Training Examples.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision, 2017

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

TALL: Temporal Activity Localization via Language Query.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

DECK: Discovering Event Composition Knowledge from Web Images for Zero-Shot Event Detection and Recounting in Videos.

[BibT_eX]

[DOI]

Chuang Gan

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016

ACD: Action Concept Discovery from Image-Sentence Corpora.

[BibT_eX]

[DOI]

Jiyang Gao

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2016, 2016

ProNet: Learning to Propose Object-Specific Boxes for Cascaded Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

2015

Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15, Brisbane, Australia, October 26, 2015

Automatic Concept Discovery from Parallel Text and Visual Corpora.

[BibT_eX]

[DOI]

Chuang Gan

Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015

2014

Evaluating Multimedia Features and Fusion for Example-Based Event Detection.

[BibT_eX]

[DOI]

Koen E. A. van de Sande

Arnold W. M. Smeulders

Proceedings of the Fusion in Computer Vision - Understanding Complex Visual Content, 2014

Evaluating multimedia features and fusion for example-based event detection.

[BibT_eX]

[DOI]

Koen E. A. van de Sande

Arnold W. M. Smeulders

Cees G. M. Snoek

Mach. Vis. Appl., 2014

ISOMER: Informative Segment Observations for Multimedia Event Recounting.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Multimedia Retrieval, 2014

Late fusion and calibration for multimedia event detection using few examples.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2014

Semantic Aware Video Transcription Using Random Forest Classifiers.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2014, 2014

DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting.

[BibT_eX]

[DOI]

Ramakant Nevatia

Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014

2013

Large-scale web video event classification by use of Fisher Vectors.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision, 2013

The 2013 SESAME Multimedia Event Detection and Recounting System.

[BibT_eX]

[DOI]

Proceedings of the 2013 TREC Video Retrieval Evaluation, 2013

ACTIVE: Activity Concept Transitions in Video Event Classification.

[BibT_eX]

[DOI]