Meng Cao

Orcid: 0000-0002-8946-4228

Affiliations:

Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Peking University, School of Electronic and Computer Engineering, Shenzhen, China (PhD 2023)

According to our database¹, Meng Cao authored at least 64 papers between 2019 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2026

A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model.

[BibT_eX]

[DOI]

CoRR, April, 2026

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation.

[BibT_eX]

[DOI]

CoRR, March, 2026

COLT: Enhancing Video Large Language Models with Continual Tool Usage.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2026

Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2026

Video Spatial Reasoning with Object-Centric 3D Rollout.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

Bring Your Dreams to Life: Continual Text-to-Video Customization.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models.

[BibT_eX]

[DOI]

Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025

CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal.

[BibT_eX]

[DOI]

CoRR, December, 2025

GLaD: Geometric Latent Distillation for Vision-Language-Action Models.

[BibT_eX]

[DOI]

CoRR, December, 2025

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery.

[BibT_eX]

[DOI]

CoRR, December, 2025

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling.

[BibT_eX]

[DOI]

CoRR, December, 2025

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding.

[BibT_eX]

[DOI]

CoRR, December, 2025

C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning.

[BibT_eX]

[DOI]

CoRR, July, 2025

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding.

[BibT_eX]

[DOI]

CoRR, May, 2025

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, May, 2025

Cross-Modal Conditioned Reconstruction for Language-Guided Medical Image Segmentation.

[BibT_eX]

[DOI]

IEEE Trans. Medical Imaging, April, 2025

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese.

[BibT_eX]

[DOI]

CoRR, April, 2025

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, April, 2025

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation.

[BibT_eX]

[DOI]

CoRR, April, 2025

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models.

[BibT_eX]

[DOI]

CoRR, March, 2025

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba.

[BibT_eX]

[DOI]

CoRR, February, 2025

ChineseSimpleVQA - "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models.

[BibT_eX]

[DOI]

CoRR, February, 2025

When Large Vision Language Models Meet Multimodal Sequential Recommendation: An Empirical Study.

[BibT_eX]

[DOI]

Raymond Chi-Wing Wong

Sunghun Kim

Proceedings of the ACM on Web Conference 2025, 2025

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, 2025

<i>ClimateIQA: </i> A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, 2025

Evagaussians: Event Stream Assisted Gaussian Splatting from Blurry Images.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

$A_{0}$: An Affordance-Aware Hierarchical Model for General Robotic Manipulation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024

Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., December, 2024

Visual Grounding With Dual Knowledge Distillation.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., October, 2024

EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation.

[BibT_eX]

[DOI]

CoRR, 2024

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos.

[BibT_eX]

[DOI]

CoRR, 2024

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet.

[BibT_eX]

[DOI]

CoRR, 2024

Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps.

[BibT_eX]

[DOI]

CoRR, 2024

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

Exploiting Auxiliary Caption for Video Grounding.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.

[BibT_eX]

[DOI]

Bang Yang

Meng Cao

Yuexian Zou

IEEE Trans. Image Process., 2023

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study.

[BibT_eX]

[DOI]

CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.

[BibT_eX]

[DOI]

CoRR, 2023

Generating Templated Caption for Video Grounding.

[BibT_eX]

[DOI]

CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.

[BibT_eX]

[DOI]

Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, 2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Iterative Proposal Refinement for Weakly-Supervised Video Grounding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Deep Motion Prior for Weakly-Supervised Temporal Action Localization.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2022

RR-Net: Relation Reasoning for End-to-End Human-Object Interaction Detection.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2022

All You Need Is a Second Look: Towards Arbitrary-Shaped Text Detection.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2022

Video Referring Expression Comprehension via Transformer with Content-aware Query.

[BibT_eX]

[DOI]

CoRR, 2022

Correspondence Matters for Video Referring Expression Comprehension.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Visual Relation-Aware Unsupervised Video Captioning.

[BibT_eX]

[DOI]

Puzhao Ji

Meng Cao

Yuexian Zou

Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2022, 2022

LocVTP: Video-Text Pre-training for Temporal Localization.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Unsupervised Pre-training for Temporal Action Localization Tasks.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2021

Synergic learning for noise-insensitive webly-supervised temporal action localization.

[BibT_eX]

[DOI]

Image Vis. Comput., 2021

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection.

[BibT_eX]

[DOI]

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding.

[BibT_eX]

[DOI]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

CoLA: Weakly-Supervised Temporal Action Localization With Snippet Contrastive Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Task-agnostic Temporally Consistent Facial Video Editing.

[BibT_eX]

[DOI]

CoRR, 2020

Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

All You Need is a Second Look: Towards Tighter Arbitrary Shape Text Detection.

[BibT_eX]

[DOI]

Meng Cao

Yuexian Zou

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection.

[BibT_eX]

[DOI]

IEEE Access, 2019

Meng Cao

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...