Xinhao Li

Orcid: 0009-0003-0382-3985

Affiliations:

Nanjing University, State Key Laboratory for Novel Software Technology, Nanjing, China
Shanghai AI Laboratory, OpenGVLab, Shanghai, China

According to our database¹, Xinhao Li authored at least 17 papers between 2023 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception.

[BibT_eX]

[DOI]

CoRR, September, 2025

VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking.

[BibT_eX]

[DOI]

CoRR, June, 2025

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, April, 2025

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling.

[BibT_eX]

[DOI]

CoRR, January, 2025

Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method.

[BibT_eX]

[DOI]

CoRR, January, 2025

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling.

[BibT_eX]

[DOI]

CoRR, January, 2025

Fine-grained Video-Text Retrieval: A New Benchmark and Method.

[BibT_eX]

[DOI]

CoRR, January, 2025

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Online Video Understanding: OVBench and VideoChat-Online.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2024

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model.

[BibT_eX]

[DOI]

CoRR, 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding.

[BibT_eX]

[DOI]

CoRR, 2024

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video.

[BibT_eX]

[DOI]

Xinhao Li

Yuhan Zhu

Limin Wang

Proceedings of the Computer Vision - ECCV 2024, 2024

VideoMamba: State Space Model for Efficient Video Understanding.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

2023

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video.

[BibT_eX]

[DOI]

Xinhao Li

Limin Wang

CoRR, 2023

Xinhao Li

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...