Difei Gao

Orcid: 0000-0001-8494-3492

According to our database¹, Difei Gao authored at least 58 papers between 2015 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing.

[BibT_eX]

[DOI]

CoRR, May, 2026

ShowUI-Aloha: Human-Taught GUI Agent.

[BibT_eX]

[DOI]

CoRR, January, 2026

EmoAgent: A Multi-Agent Framework for Diverse Affective Image Manipulation.

[BibT_eX]

[DOI]

IEEE Trans. Affect. Comput., 2026

2025

AUTO-Explorer: Automated Data Collection for GUI Agent.

[BibT_eX]

[DOI]

Xiangwu Guo

Difei Gao

Mike Zheng Shou

CoRR, November, 2025

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., April, 2025

EmoAgent: Multi-Agent Collaboration of Plan, Edit, and Critic, for Affective Image Manipulation.

[BibT_eX]

[DOI]

CoRR, March, 2025

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation.

[BibT_eX]

[DOI]

Henry Hengyuan Zhao

Difei Gao

Mike Zheng Shou

CoRR, February, 2025

HOVER: Hyperbolic Video-Text Retrieval.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2025

GUI-Narrator: Detecting and Captioning Computer GUI Actions.

[BibT_eX]

[DOI]

Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Can I Trust You? Advancing GUI Task Automation with Action Trust Score.

[BibT_eX]

[DOI]

Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Grounding Multimodal Large Language Model in GUI World.

[BibT_eX]

[DOI]

Weixian Lei

Difei Gao

Mike Zheng Shou

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Factorized Learning for Temporally Grounded Video-Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question Answering.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[BibT_eX]

[DOI]

CoRR, 2024

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.

[BibT_eX]

[DOI]

CoRR, 2024

GUI Action Narrator: Where and When Did That Action Take Place?

[BibT_eX]

[DOI]

CoRR, 2024

LOVA3: Learning to Visual Question Answering, Asking and Assessment.

[BibT_eX]

[DOI]

CoRR, 2024

LOVA3: Learning to Visual Question Answering, Asking and Assessment.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

AssistGPT: Towards Multi-modal Agent for Human-Centric AI Assistant.

[BibT_eX]

[DOI]

Proceedings of the 5th International Workshop on Human-centric Multimedia Analysis, 2024

Learning Video Context as Interleaved Multimodal Sequences.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

VIT-LENS: Towards Omni-modal Representations.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

AssistGUI: Task-Oriented PC Graphical User Interface Automation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VideoLLM-online: Online Video Large Language Model for Streaming Video.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

CRIC: A VQA Dataset for Compositional Reasoning on Vision and Commonsense.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., May, 2023

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation.

[BibT_eX]

[DOI]

CoRR, 2023

ViT-Lens-2: Gateway to Omni-modal Intelligence.

[BibT_eX]

[DOI]

CoRR, 2023

CVPR 2023 Text Guided Video Editing Competition.

[BibT_eX]

[DOI]

CoRR, 2023

Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces.

[BibT_eX]

[DOI]

CoRR, 2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023.

[BibT_eX]

[DOI]

CoRR, 2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn.

[BibT_eX]

[DOI]

CoRR, 2023

Mover: Mask and Recovery based Facial Part Consistency Aware Method for Deepfake Video Detection.

[BibT_eX]

[DOI]

CoRR, 2023

DeepfakeMAE: Facial Part Consistency Aware Masked Autoencoder for Deepfake Video Detection.

[BibT_eX]

[DOI]

CoRR, 2023

Learning to Learn: How to Continuously Teach Humans and Machines.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

UniVTG: Towards Unified Video-Language Temporal Grounding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.

[BibT_eX]

[DOI]

Muhammet Furkan Ilaslan

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Affordance Grounding from Demonstration Video to Target Image.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022.

[BibT_eX]

[DOI]

CoRR, 2022

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022.

[BibT_eX]

[DOI]

CoRR, 2022

Egocentric Video-Language Pretraining.

[BibT_eX]

[DOI]

CoRR, 2022

GEB+: A benchmark for generic event boundary captioning, grounding and text-based retrieval.

[BibT_eX]

[DOI]

CoRR, 2022

Egocentric Video-Language Pretraining.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

2021

AssistSR: Affordance-centric Question-driven Video Segment Retrieval.

[BibT_eX]

[DOI]

CoRR, 2021

Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020

Learning to Recognize Visual Concepts for Visual Question Answering With Structural Label Space.

[BibT_eX]

[DOI]

IEEE J. Sel. Top. Signal Process., 2020

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019

From Two Graphs to N Questions: A VQA Dataset for Compositional Reasoning on Vision and Commonsense.

[BibT_eX]

[DOI]

CoRR, 2019

2017

Visual Textbook Network: Watch Carefully before Answering Visual Questions.

[BibT_eX]

[DOI]

Proceedings of the British Machine Vision Conference 2017, 2017

2015

Correlated warped Gaussian processes for gender-specific age estimation.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Image Processing, 2015

Difei Gao

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...