Yale Song

CoRR, April, 2026

PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing.

[BibT_eX]

[DOI]

CoRR, April, 2026

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks.

[BibT_eX]

[DOI]

CoRR, March, 2026

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement.

[BibT_eX]

[DOI]

Yiwen Song

Tomas Pfister

CoRR, March, 2026

PaperBanana: Automating Academic Illustration for AI Scientists.

[BibT_eX]

[DOI]

CoRR, January, 2026

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction.

[BibT_eX]

[DOI]

Ce Zhang

Ruta Desai

Michael Louis Iuzzolino

Joseph Tighe

Gedas Bertasius

Satwik Kottur

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026

2025

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

Oluwatumininu Oguntola

Kiran K. Somasundaram

Giovanni Maria Farinella

Int. J. Comput. Vis., December, 2025

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding.

[BibT_eX]

[DOI]

CoRR, April, 2025

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding.

[BibT_eX]

[DOI]

CoRR, April, 2025

Enrich and Detect: Video Temporal Grounding With Multimodal Llms.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Streaming Videollms for Real-Time Procedural Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

VITED: Video Temporal Evidence Distillation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

Oluwatumininu Oguntola

Kiran K. Somasundaram

Giovanni Maria Farinella

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

et al.

CoRR, 2023

Egocentric Video Task Translation @ Ego4D Challenge 2022.

[BibT_eX]

[DOI]

CoRR, 2023

Scaling Novel Object Detection with Weakly Supervised Detection Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Egocentric Video Task Translation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Video Summarization Overview.

[BibT_eX]

[DOI]

Mayu Otani

Yang Wang

Found. Trends Comput. Graph. Vis., 2022

PatchBlender: A Motion Prior for Video Transformers.

[BibT_eX]

[DOI]

Gabriele Prato

Caio César Teodoro Mendes

Janarthanan Rajendran

R. Devon Hjelm

Neel Joshi

Sarath Chandar

CoRR, 2022

One Network Doesn't Rule Them All: Moving Beyond Handcrafted Architectures in Self-Supervised Learning.

[BibT_eX]

[DOI]

Abhinav Shrivastava

CoRR, 2022

COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2022

Visual Attention Emerges from Recurrent Sparse Reconstruction.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2022

Anomaly Detection in Time Series with Robust Variational Quasi-Recurrent Autoencoders.

[BibT_eX]

[DOI]

Tung Kieu

Bin Yang

Chenjuan Guo

Razvan-Gabriel Cirstea

Yan Zhao

Christian S. Jensen

Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

Neural-Sim: Learning to Generate Training Data with NeRF.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Robust Contrastive Learning against Noisy Views.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

CausalCity: Complex Simulations with Agency for Causal Discovery and Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 1st Conference on Causal Learning and Reasoning, 2022

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

On the Virality of Animated GIFs on Tumblr.

[BibT_eX]

[DOI]

Yunseok Jang

Gunhee Kim

CoRR, 2021

Contrastive Learning of Global and Local Audio-Visual Representations.

[BibT_eX]

[DOI]

CoRR, 2021

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2021

Contrastive Learning of Global and Local Video Representations.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Self-Supervised Learning of Compressed Video Representations.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

Active Contrastive Learning of Audio-Visual Video Representations.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

Parameter Efficient Multimodal Transformers for Video Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020

Learning to Transfer Visual Effects from Videos to Images.

[BibT_eX]

[DOI]

Christopher Thomas

Adriana Kovashka

CoRR, 2020

Learning Audio-Visual Representations with Active Contrastive Coding.

[BibT_eX]

[DOI]

CoRR, 2020

Phans, Stans and Cishets: Self-Presentation Effects on Content Propagation in Tumblr.

[BibT_eX]

[DOI]

Proceedings of the WebSci '20: 12th ACM Conference on Web Science, 2020

Image to Video Domain Adaptation Using Web Supervision.

[BibT_eX]

[DOI]

Andrew Kae

Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2020

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval.

[BibT_eX]

[DOI]

Proceedings of the 25th International Conference on Pattern Recognition, 2020

2019

Video Question Answering with Spatio-Temporal Reasoning.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2019

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention.

[BibT_eX]

[DOI]

Shuang Ma

Daniel McDuff

CoRR, 2019

Characterizing Bias in Classifiers using Generative Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Neural TTS Stylization with Adversarial and Collaborative Games.

[BibT_eX]

[DOI]

Shuang Ma

Daniel McDuff

Proceedings of the 7th International Conference on Learning Representations, 2019

Unpaired Image-to-Speech Synthesis With Multimodal Information Bottleneck.

[BibT_eX]

[DOI]

Shuang Ma

Daniel McDuff

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.

[BibT_eX]

[DOI]

Mohammad Soleymani

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

Cross-Modal Retrieval with Implicit Concept Association.

[BibT_eX]

[DOI]

Mohammad Soleymani

CoRR, 2018

Image2GIF: Generating Cinemagraphs Using Recurrent Deep Q-Networks.

[BibT_eX]

[DOI]

Yipin Zhou

Tamara L. Berg

Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, 2018

Video Prediction with Appearance and Motion Conditions.

[BibT_eX]

[DOI]

Yunseok Jang

Gunhee Kim

Proceedings of the 35th International Conference on Machine Learning, 2018

2017

Learning from Noisy Labels with Distillation.

[BibT_eX]

[DOI]

CoRR, 2017

ElasticPlay: Interactive Video Summarization with Dynamic Time Budgets.

[BibT_eX]

[DOI]

Haojian Jin

Koji Yatani

Proceedings of the 2017 ACM on Multimedia Conference, 2017

Learning from Noisy Labels with Distillation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Computer Vision, 2017

Improving Pairwise Ranking for Multi-label Image Classification.

[BibT_eX]

[DOI]

Yuncheng Li

Jiebo Luo

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

2016

Real-Time Video Highlights for Yahoo Esports.

[BibT_eX]

[DOI]

CoRR, 2016

Mouse Activity as an Indicator of Interestingness in Video.

[BibT_eX]

[DOI]

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

Balancing Appearance and Context in Sketch Interpretation.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016

TGIF: A New Dataset and Benchmark on Animated GIF Description.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

Video2GIF: Automatic Generation of Animated GIFs from Video.

[BibT_eX]

[DOI]

Michael Gygli

Liangliang Cao

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos.

[BibT_eX]

[DOI]

Proceedings of the 25th ACM International Conference on Information and Knowledge Management, 2016

Fast, Cheap, and Good: Why Animated GIFs Engage Us.

[BibT_eX]

[DOI]

Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2016

2015

Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction: Extended Abstract.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015

Exploiting sparsity and co-occurrence structure for action unit recognition.

[BibT_eX]

[DOI]

Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2015

TVSum: Summarizing web videos using titles.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

Video co-summarization: Video summarization by visual co-occurrence.

[BibT_eX]

[DOI]

Wen-Sheng Chu

Alejandro Jaimes

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

2014

Structured video content analysis: learning spatio-temporal and multimodal structures.

[BibT_eX]

[DOI]

PhD thesis, 2014

#FluxFlow: Visual Analysis of Anomalous Information Spreading on Social Media.

[BibT_eX]

[DOI]

IEEE Trans. Vis. Comput. Graph., 2014

2013

One-Class Conditional Random Fields for Sequential Anomaly Detection.

[BibT_eX]

[DOI]

Proceedings of the IJCAI 2013, 2013

Learning a sparse codebook of facial and body microexpressions for emotion recognition.

[BibT_eX]

[DOI]

Proceedings of the 2013 International Conference on Multimodal Interaction, 2013

Distribution-sensitive learning for imbalanced datasets.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2013

Action Recognition by Hierarchical Sequence Summarization.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013

2012

Continuous body and hand gesture recognition for natural human-computer interaction.

[BibT_eX]

[DOI]

David Demirdjian

ACM Trans. Interact. Intell. Syst., 2012

Multimodal human behavior analysis: learning correlation and interaction across modalities.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Multimodal Interaction, 2012

Multi-view latent variable discriminative models for action recognition.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012

2011

Tracking body and hands for gesture recognition: NATOPS aircraft handling signals database.

[BibT_eX]

[DOI]

David Demirdjian

Proceedings of the Ninth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2011), 2011

Multi-signal gesture recognition using temporal smoothing hidden conditional random fields.

[BibT_eX]

[DOI]

David Demirdjian