Mu Cai

Orcid: 0009-0008-7967-9752

According to our database1, Mu Cai authored at least 29 papers between 2020 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios.
CoRR, July, 2025

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.
CoRR, July, 2025

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models.
CoRR, May, 2025

Magma: A Foundation Model for Multimodal AI Agents.
CoRR, February, 2025

An Investigation on LLMs' Visual Understanding Ability Using SVG for Image-Text Bridging.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

Matryoshka Multimodal Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Magma: A Foundation Model for Multimodal AI Agents.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models.
CoRR, 2024

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos.
CoRR, 2024

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner.
CoRR, 2024

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models.
CoRR, 2024

Yo'LLaVA: Your Personalized Language and Vision Assistant.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment.
Proceedings of the Computer Vision - ECCV 2024, 2024

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Investigating the Catastrophic Forgetting in Multimodal Large Language Model Fine-Tuning.
Proceedings of the Conference on Parsimony and Learning, 2024

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
Making Large Multimodal Models Understand Arbitrary Visual Prompts.
CoRR, 2023

Investigating the Catastrophic Forgetting in Multimodal Large Language Models.
CoRR, 2023

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding.
CoRR, 2023

Out-of-distribution Detection via Frequency-regularized Generative Models.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2022
VOS: Learning What You Don't Know by Virtual Outlier Synthesis.
Proceedings of the Tenth International Conference on Learning Representations, 2022

Masked Discrimination for Self-supervised Learning on Point Clouds.
Proceedings of the Computer Vision - ECCV 2022, 2022

2021
Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

2020
Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving.
CoRR, 2020

A Game-Theoretic Strategy-Aware Interaction Algorithm with Validation on Real Traffic Data.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020


  Loading...