Meng Cao

Orcid: 0000-0002-8946-4228

Affiliations:
  • International Digital Economy Academy (IDEA), China
  • Peking University, School of Electronic and Computer Engineering, Shenzhen, China (PhD 2023)


According to our database1, Meng Cao authored at least 41 papers between 2019 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding.
CoRR, May, 2025

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning.
CoRR, May, 2025

Cross-Modal Conditioned Reconstruction for Language-Guided Medical Image Segmentation.
IEEE Trans. Medical Imaging, April, 2025

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese.
CoRR, April, 2025

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs.
CoRR, April, 2025

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models.
CoRR, March, 2025

When Large Vision Language Models Meet Multimodal Sequential Recommendation: An Empirical Study.
Proceedings of the ACM on Web Conference 2025, 2025

2024
Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards.
ACM Trans. Multim. Comput. Commun. Appl., December, 2024

Visual Grounding With Dual Knowledge Distillation.
IEEE Trans. Circuits Syst. Video Technol., October, 2024

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos.
CoRR, 2024

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models.
CoRR, 2024

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet.
CoRR, 2024

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval.
CoRR, 2024

Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps.
CoRR, 2024

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Exploiting Auxiliary Caption for Video Grounding.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.
IEEE Trans. Image Process., 2023

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study.
CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.
CoRR, 2023

Generating Templated Caption for Video Grounding.
CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.
Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, 2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Iterative Proposal Refinement for Weakly-Supervised Video Grounding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022
Deep Motion Prior for Weakly-Supervised Temporal Action Localization.
IEEE Trans. Image Process., 2022

RR-Net: Relation Reasoning for End-to-End Human-Object Interaction Detection.
IEEE Trans. Circuits Syst. Video Technol., 2022

All You Need Is a Second Look: Towards Arbitrary-Shaped Text Detection.
IEEE Trans. Circuits Syst. Video Technol., 2022

Video Referring Expression Comprehension via Transformer with Content-aware Query.
CoRR, 2022

Correspondence Matters for Video Referring Expression Comprehension.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Visual Relation-Aware Unsupervised Video Captioning.
Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2022, 2022

LocVTP: Video-Text Pre-training for Temporal Localization.
Proceedings of the Computer Vision - ECCV 2022, 2022

Unsupervised Pre-training for Temporal Action Localization Tasks.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing.
IEEE Trans. Image Process., 2021

Synergic learning for noise-insensitive webly-supervised temporal action localization.
Image Vis. Comput., 2021

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection.
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

CoLA: Weakly-Supervised Temporal Action Localization With Snippet Contrastive Learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
Task-agnostic Temporally Consistent Facial Video Editing.
CoRR, 2020

Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

All You Need is a Second Look: Towards Tighter Arbitrary Shape Text Detection.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection.
IEEE Access, 2019


  Loading...