Mu Cai

Orcid: 0009-0008-7967-9752

According to our database¹, Mu Cai authored at least 30 papers between 2020 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

Contamination Detection for VLMs using Multi-Modal Semantic Perturbation.

[BibT_eX]

[DOI]

CoRR, November, 2025

RECODE: Reasoning Through Code Generation for Visual Question Answering.

[BibT_eX]

[DOI]

CoRR, October, 2025

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios.

[BibT_eX]

[DOI]

CoRR, July, 2025

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models.

[BibT_eX]

[DOI]

CoRR, May, 2025

Magma: A Foundation Model for Multimodal AI Agents.

[BibT_eX]

[DOI]

CoRR, February, 2025

An Investigation on LLMs' Visual Understanding Ability Using SVG for Image-Text Bridging.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

Matryoshka Multimodal Models.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Magma: A Foundation Model for Multimodal AI Agents.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models.

[BibT_eX]

[DOI]

CoRR, 2024

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos.

[BibT_eX]

[DOI]

Jianrui Zhang

Mu Cai

Yong Jae Lee

CoRR, 2024

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner.

[BibT_eX]

[DOI]

CoRR, 2024

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models.

[BibT_eX]

[DOI]

CoRR, 2024

Yo'LLaVA: Your Personalized Language and Vision Assistant.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds.

[BibT_eX]

[DOI]

Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts.

[BibT_eX]

[DOI]

Mu Cai

Haotian Liu

Siva Karthik Mustikovela

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Investigating the Catastrophic Forgetting in Multimodal Large Language Model Fine-Tuning.

[BibT_eX]

[DOI]

Proceedings of the Conference on Parsimony and Learning, 2024

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023

Making Large Multimodal Models Understand Arbitrary Visual Prompts.

[BibT_eX]

[DOI]

Mu Cai

Haotian Liu

Siva Karthik Mustikovela

CoRR, 2023

Investigating the Catastrophic Forgetting in Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding.

[BibT_eX]

[DOI]

CoRR, 2023

Out-of-distribution Detection via Frequency-regularized Generative Models.

[BibT_eX]

[DOI]

Mu Cai

Yixuan Li

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2022

VOS: Learning What You Don't Know by Virtual Outlier Synthesis.

[BibT_eX]

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

Masked Discrimination for Self-supervised Learning on Point Clouds.

[BibT_eX]

[DOI]

Haotian Liu

Mu Cai

Yong Jae Lee

Proceedings of the Computer Vision - ECCV 2022, 2022

2021

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving.

[BibT_eX]

[DOI]

CoRR, 2020

A Game-Theoretic Strategy-Aware Interaction Algorithm with Validation on Real Traffic Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020

Mu Cai

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...