Meng Cao

Orcid: 0000-0002-8946-4228

Affiliations:

International Digital Economy Academy (IDEA), China
Peking University, School of Electronic and Computer Engineering, Shenzhen, China (PhD 2023)

According to our database¹, Meng Cao authored at least 42 papers between 2019 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding.

[BibT_eX]

[DOI]

CoRR, May, 2025

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning.

[BibT_eX]

[DOI]

CoRR, May, 2025

Cross-Modal Conditioned Reconstruction for Language-Guided Medical Image Segmentation.

[BibT_eX]

[DOI]

IEEE Trans. Medical Imaging, April, 2025

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese.

[BibT_eX]

[DOI]

CoRR, April, 2025

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs.

[BibT_eX]

[DOI]

CoRR, April, 2025

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models.

[BibT_eX]

[DOI]

CoRR, March, 2025

When Large Vision Language Models Meet Multimodal Sequential Recommendation: An Empirical Study.

[BibT_eX]

[DOI]

Raymond Chi-Wing Wong

Sunghun Kim

Proceedings of the ACM on Web Conference 2025, 2025

<i>ClimateIQA: </i> A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, 2025

2024

Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards.

[BibT_eX]

[DOI]

ACM Trans. Multim. Comput. Commun. Appl., December, 2024

Visual Grounding With Dual Knowledge Distillation.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., October, 2024

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos.

[BibT_eX]

[DOI]

CoRR, 2024

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet.

[BibT_eX]

[DOI]

CoRR, 2024

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval.

[BibT_eX]

[DOI]

CoRR, 2024

Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps.

[BibT_eX]

[DOI]

CoRR, 2024

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

Exploiting Auxiliary Caption for Video Grounding.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.

[BibT_eX]

[DOI]

Bang Yang

Meng Cao

Yuexian Zou

IEEE Trans. Image Process., 2023

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study.

[BibT_eX]

[DOI]

CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.

[BibT_eX]

[DOI]

CoRR, 2023

Generating Templated Caption for Video Grounding.

[BibT_eX]

[DOI]

CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.

[BibT_eX]

[DOI]

Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, 2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Iterative Proposal Refinement for Weakly-Supervised Video Grounding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

Deep Motion Prior for Weakly-Supervised Temporal Action Localization.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2022

RR-Net: Relation Reasoning for End-to-End Human-Object Interaction Detection.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2022

All You Need Is a Second Look: Towards Arbitrary-Shaped Text Detection.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2022

Video Referring Expression Comprehension via Transformer with Content-aware Query.

[BibT_eX]

[DOI]

CoRR, 2022

Correspondence Matters for Video Referring Expression Comprehension.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Visual Relation-Aware Unsupervised Video Captioning.

[BibT_eX]

[DOI]

Puzhao Ji

Meng Cao

Yuexian Zou

Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2022, 2022

LocVTP: Video-Text Pre-training for Temporal Localization.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Unsupervised Pre-training for Temporal Action Localization Tasks.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2021

Synergic learning for noise-insensitive webly-supervised temporal action localization.

[BibT_eX]

[DOI]

Image Vis. Comput., 2021

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection.

[BibT_eX]

[DOI]

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding.

[BibT_eX]

[DOI]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

CoLA: Weakly-Supervised Temporal Action Localization With Snippet Contrastive Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Task-agnostic Temporally Consistent Facial Video Editing.

[BibT_eX]

[DOI]

CoRR, 2020

Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

All You Need is a Second Look: Towards Tighter Arbitrary Shape Text Detection.

[BibT_eX]

[DOI]

Meng Cao

Yuexian Zou

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection.

[BibT_eX]

[DOI]

IEEE Access, 2019

Meng Cao

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...