Meng Cao

Orcid: 0000-0002-8946-4228

Affiliations:
  • Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
  • Peking University, School of Electronic and Computer Engineering, Shenzhen, China (PhD 2023)


According to our database1, Meng Cao authored at least 64 papers between 2019 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model.
CoRR, April, 2026

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation.
CoRR, March, 2026

COLT: Enhancing Video Large Language Models with Continual Tool Usage.
Trans. Mach. Learn. Res., 2026

Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos.
Trans. Mach. Learn. Res., 2026

Video Spatial Reasoning with Object-Centric 3D Rollout.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

Bring Your Dreams to Life: Continual Text-to-Video Customization.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal.
CoRR, December, 2025

GLaD: Geometric Latent Distillation for Vision-Language-Action Models.
CoRR, December, 2025

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery.
CoRR, December, 2025

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling.
CoRR, December, 2025

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding.
CoRR, December, 2025

C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning.
CoRR, July, 2025

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding.
CoRR, May, 2025

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning.
CoRR, May, 2025

Cross-Modal Conditioned Reconstruction for Language-Guided Medical Image Segmentation.
IEEE Trans. Medical Imaging, April, 2025

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese.
CoRR, April, 2025

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs.
CoRR, April, 2025

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation.
CoRR, April, 2025

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models.
CoRR, March, 2025

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba.
CoRR, February, 2025

ChineseSimpleVQA - "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models.
CoRR, February, 2025

When Large Vision Language Models Meet Multimodal Sequential Recommendation: An Empirical Study.
Proceedings of the ACM on Web Conference 2025, 2025

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

<i>ClimateIQA: </i> A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis.
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, 2025

Evagaussians: Event Stream Assisted Gaussian Splatting from Blurry Images.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

$A_{0}$: An Affordance-Aware Hierarchical Model for General Robotic Manipulation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval.
Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024
Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards.
ACM Trans. Multim. Comput. Commun. Appl., December, 2024

Visual Grounding With Dual Knowledge Distillation.
IEEE Trans. Circuits Syst. Video Technol., October, 2024

EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation.
CoRR, 2024

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos.
CoRR, 2024

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models.
CoRR, 2024

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet.
CoRR, 2024

Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps.
CoRR, 2024

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?
Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Exploiting Auxiliary Caption for Video Grounding.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.
IEEE Trans. Image Process., 2023

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study.
CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.
CoRR, 2023

Generating Templated Caption for Video Grounding.
CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.
Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, 2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Iterative Proposal Refinement for Weakly-Supervised Video Grounding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022
Deep Motion Prior for Weakly-Supervised Temporal Action Localization.
IEEE Trans. Image Process., 2022

RR-Net: Relation Reasoning for End-to-End Human-Object Interaction Detection.
IEEE Trans. Circuits Syst. Video Technol., 2022

All You Need Is a Second Look: Towards Arbitrary-Shaped Text Detection.
IEEE Trans. Circuits Syst. Video Technol., 2022

Video Referring Expression Comprehension via Transformer with Content-aware Query.
CoRR, 2022

Correspondence Matters for Video Referring Expression Comprehension.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Visual Relation-Aware Unsupervised Video Captioning.
Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2022, 2022

LocVTP: Video-Text Pre-training for Temporal Localization.
Proceedings of the Computer Vision - ECCV 2022, 2022

Unsupervised Pre-training for Temporal Action Localization Tasks.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing.
IEEE Trans. Image Process., 2021

Synergic learning for noise-insensitive webly-supervised temporal action localization.
Image Vis. Comput., 2021

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection.
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

CoLA: Weakly-Supervised Temporal Action Localization With Snippet Contrastive Learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
Task-agnostic Temporally Consistent Facial Video Editing.
CoRR, 2020

Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

All You Need is a Second Look: Towards Tighter Arbitrary Shape Text Detection.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection.
IEEE Access, 2019


  Loading...