Yuxuan Wang

Orcid: 0000-0002-3889-8560

Affiliations:
  • Alibaba Inc., Qwen team, Beijing, China
  • Peking University, Institute of Computer Technology, Beijing, China
  • Peking University, Center for Data Science, Beijing, China
  • Beijing Institute for General Artificial Intelligence (BIGAI), China


According to our database1, Yuxuan Wang authored at least 22 papers between 2022 and 2025.

Collaborative distances:
  • Dijkstra number2 of five.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Probing and Inducing Combinational Creativity in Vision-Language Models.
CoRR, April, 2025

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens.
CoRR, February, 2025

LongViTU: Instruction Tuning for Long-Form Video Understanding.
CoRR, January, 2025

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format.
CoRR, 2024

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment.
CoRR, 2024

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges.
CoRR, 2024

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning.
CoRR, 2024

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models.
CoRR, 2024

HawkEye: Training Video-Text LLMs for Grounding Text in Videos.
CoRR, 2024

LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding.
CoRR, 2024

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Teaching Text-to-Image Models to Communicate.
CoRR, 2023

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning.
CoRR, 2023

Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training.
CoRR, 2023

Overview of the NLPCC 2023 Shared Task 10: Learn to Watch TV: Multimodal Dialogue Understanding and Response Generation.
Proceedings of the Natural Language Processing and Chinese Computing, 2023

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Rethinking Dictionaries and Glyphs for Chinese Language Pre-training.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2022
Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation.
Proceedings of the Natural Language Processing and Chinese Computing, 2022

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022


  Loading...