Qing Li

Orcid: 0000-0003-1185-5365

Affiliations:

Beijing Institute for General Artificial Intelligence (BIGAI), National Key Laboratory of General Artificial Intelligence, Beijing, China
University of California, Los Angeles, CA, USA (former)
University of Science and Technology of China, Hefei, China (former)

According to our database¹, Qing Li authored at least 56 papers between 2016 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks.

[BibT_eX]

[DOI]

CoRR, October, 2025

KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints.

[BibT_eX]

[DOI]

CoRR, October, 2025

NEP: Autoregressive Image Editing via Next Editing Token Prediction.

[BibT_eX]

[DOI]

CoRR, August, 2025

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation.

[BibT_eX]

[DOI]

CoRR, July, 2025

LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation.

[BibT_eX]

[DOI]

CoRR, June, 2025

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes.

[BibT_eX]

[DOI]

CoRR, June, 2025

When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways.

[BibT_eX]

[DOI]

CoRR, May, 2025

Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL.

[BibT_eX]

[DOI]

CoRR, May, 2025

FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation.

[BibT_eX]

[DOI]

CoRR, May, 2025

Iterative Trajectory Exploration for Multimodal Agents.

[BibT_eX]

[DOI]

CoRR, April, 2025

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials.

[BibT_eX]

[DOI]

CoRR, April, 2025

Building LLM Agents by Incorporating Insights from Computer Systems.

[BibT_eX]

[DOI]

CoRR, April, 2025

LongViTU: Instruction Tuning for Long-Form Video Understanding.

[BibT_eX]

[DOI]

CoRR, January, 2025

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding.

[BibT_eX]

[DOI]

CoRR, January, 2025

The AI Hippocampus: How Far are We From Human Memory?

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2025

SYNERGAI: Perception Alignment for Human-Robot Collaboration.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Robotics and Automation, 2025

Falcon: Fast Visuomotor Policies via Partial Denoising.

[BibT_eX]

[DOI]

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

METASCENES: Towards Automated Replica Creation for Real-world 3D Scans.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

Task-oriented Sequential Grounding in 3D Scenes.

[BibT_eX]

[DOI]

CoRR, 2024

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting.

[BibT_eX]

[DOI]

CoRR, 2024

INSIGHT: End-to-End Neuro-Symbolic Visual Reinforcement Learning with Language Explanations.

[BibT_eX]

[DOI]

CoRR, 2024

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding.

[BibT_eX]

[DOI]

CoRR, 2024

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey.

[BibT_eX]

[DOI]

CoRR, 2024

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

An Embodied Generalist Agent in 3D World.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Neural-Symbolic Recursive Machine for Systematic Generalization.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Unifying 3D Vision-Language Understanding via Promptable Queries.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

[inline-graphic not available: see fulltext]VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Learning non-Markovian Decision-Making from State-only Sequences.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

SQA3D: Situated Question Answering in 3D Scenes.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2022

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation.

[BibT_eX]

[DOI]

CoRR, 2022

2021

A HINT from Arithmetic: On Systematic Generalization of Perception, Syntax, and Semantics.

[BibT_eX]

[DOI]

CoRR, 2021

VLGrammar: Grounded Grammar Induction of Vision and Language.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

YouRefIt: Embodied Reference Understanding with Language and Gesture.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

SMART: A Situation Model for Algebra Story Problems via Attributed Grammar.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

Learning by Fixing: Solving Math Word Problems with Weak Supervision.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020

Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Machine Learning, 2020

A Competence-Aware Curriculum for Visual Concepts Learning via Question Answering.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

2019

Why Does a Visual Question Have Different Answers?

[BibT_eX]

[DOI]

Nilavra Bhattacharya

Qing Li

Danna Gurari

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions.

[BibT_eX]

[DOI]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31, 2018

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2018, 2018

VizWiz Grand Challenge: Answering Visual Questions From Blind People.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017

Learning hierarchical video representation for action recognition.

[BibT_eX]

[DOI]

Int. J. Multim. Inf. Retr., 2017

2016

Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation.

[BibT_eX]

[DOI]

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

Qing Li

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...