Difei Gao

Orcid: 0000-0001-8494-3492

According to our database1, Difei Gao authored at least 56 papers between 2015 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

On csauthors.net:

Bibliography

2026
ShowUI-Aloha: Human-Taught GUI Agent.
CoRR, January, 2026

2025
AUTO-Explorer: Automated Data Collection for GUI Agent.
CoRR, November, 2025

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation.
Int. J. Comput. Vis., April, 2025

EmoAgent: Multi-Agent Collaboration of Plan, Edit, and Critic, for Affective Image Manipulation.
CoRR, March, 2025

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation.
CoRR, February, 2025

HOVER: Hyperbolic Video-Text Retrieval.
IEEE Trans. Image Process., 2025

GUI-Narrator: Detecting and Captioning Computer GUI Actions.
Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Can I Trust You? Advancing GUI Task Automation with Action Trust Score.
Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Grounding Multimodal Large Language Model in GUI World.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Factorized Learning for Temporally Grounded Video-Language Models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024
Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question Answering.
IEEE Trans. Image Process., 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.
CoRR, 2024

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.
CoRR, 2024

GUI Action Narrator: Where and When Did That Action Take Place?
CoRR, 2024

LOVA3: Learning to Visual Question Answering, Asking and Assessment.
CoRR, 2024

LOVA3: Learning to Visual Question Answering, Asking and Assessment.
Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.
Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

AssistGPT: Towards Multi-modal Agent for Human-Centric AI Assistant.
Proceedings of the 5th International Workshop on Human-centric Multimedia Analysis, 2024

Learning Video Context as Interleaved Multimodal Sequences.
Proceedings of the Computer Vision - ECCV 2024, 2024

VIT-LENS: Towards Omni-modal Representations.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

AssistGUI: Task-Oriented PC Graphical User Interface Automation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VideoLLM-online: Online Video Large Language Model for Streaming Video.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
CRIC: A VQA Dataset for Compositional Reasoning on Vision and Commonsense.
IEEE Trans. Pattern Anal. Mach. Intell., May, 2023

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation.
CoRR, 2023

ViT-Lens-2: Gateway to Omni-modal Intelligence.
CoRR, 2023

CVPR 2023 Text Guided Video Editing Competition.
CoRR, 2023

Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces.
CoRR, 2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023.
CoRR, 2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn.
CoRR, 2023

Mover: Mask and Recovery based Facial Part Consistency Aware Method for Deepfake Video Detection.
CoRR, 2023

DeepfakeMAE: Facial Part Consistency Aware Masked Autoencoder for Deepfake Video Detection.
CoRR, 2023

Learning to Learn: How to Continuously Teach Humans and Machines.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

UniVTG: Towards Unified Video-Language Temporal Grounding.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Affordance Grounding from Demonstration Video to Target Image.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022.
CoRR, 2022

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022.
CoRR, 2022

Egocentric Video-Language Pretraining.
CoRR, 2022

GEB+: A benchmark for generic event boundary captioning, grounding and text-based retrieval.
CoRR, 2022

Egocentric Video-Language Pretraining.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant.
Proceedings of the Computer Vision - ECCV 2022, 2022

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval.
Proceedings of the Computer Vision - ECCV 2022, 2022

2021
AssistSR: Affordance-centric Question-driven Video Segment Retrieval.
CoRR, 2021

Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020
Learning to Recognize Visual Concepts for Visual Question Answering With Structural Label Space.
IEEE J. Sel. Top. Signal Process., 2020

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
From Two Graphs to N Questions: A VQA Dataset for Compositional Reasoning on Vision and Commonsense.
CoRR, 2019

2017
Visual Textbook Network: Watch Carefully before Answering Visual Questions.
Proceedings of the British Machine Vision Conference 2017, 2017

2015
Correlated warped Gaussian processes for gender-specific age estimation.
Proceedings of the 2015 IEEE International Conference on Image Processing, 2015


  Loading...