Zehan Wang

Orcid: 0009-0007-6426-3749

Affiliations:
  • Zhejiang University, China


According to our database1, Zehan Wang authored at least 48 papers between 2023 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
DSI-Bench: A Benchmark for Dynamic Spatial Intelligence.
CoRR, October, 2025

APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization.
CoRR, June, 2025

GenSpace: Benchmarking Spatially-Aware Image Generation.
CoRR, May, 2025

Depth Anything with Any Prior.
CoRR, May, 2025

Unleashing the Power of Natural Audio Featuring Multiple Sound Sources.
CoRR, April, 2025

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios.
CoRR, January, 2025

EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration.
Proceedings of the ACM on Web Conference 2025, 2025

Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29, 2025

Multimodal Conditional Retrieval with High Controllability.
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, 2025

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models.
Proceedings of the Forty-second International Conference on Machine Learning, 2025

Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Improving Long-Text Alignment for Text-to-Image Diffusion Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
WavChat: A Survey of Spoken Dialogue Models.
CoRR, 2024

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
CoRR, 2024

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes.
CoRR, 2024

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.
CoRR, 2024

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT.
CoRR, 2024

Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching.
CoRR, 2024

Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Extending Multi-modal Contrastive Representations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Action Imitation in Common Action Space for Customized Action Image Synthesis.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

InstructSpeech: Following Speech Editing Instructions via Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
CoRR, 2023

Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding.
CoRR, 2023

Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers.
CoRR, 2023

Extending Multi-modal Contrastive Representations.
CoRR, 2023

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes.
CoRR, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
CoRR, 2023

Connecting Multi-modal Contrastive Representations.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Scene-robust Natural Language Video Localization via Learning Domain-invariant Representations.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023


  Loading...