Zhen Ye

Orcid: 0009-0003-6932-9859

Affiliations:
  • Hong Kong University of Science and Technology, Hong Kong, SAR, China


According to our database1, Zhen Ye authored at least 22 papers between 2023 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling.
CoRR, April, 2026

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking.
CoRR, January, 2026

Inference-time Scaling for Diffusion-based Audio Super-resolution.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025
J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge.
CoRR, May, 2025

YuE: Scaling Open Foundation Models for Long-Form Music Generation.
CoRR, March, 2025

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens.
CoRR, March, 2025

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis.
CoRR, February, 2025

ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets.
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2025

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model.
Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models.
CoRR, 2024

FlashSpeech: Efficient Zero-Shot Speech Synthesis.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

COMOSVC: Consistency Model-Based Singing Voice Conversion.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

2023
CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

NAS-FM: Neural Architecture Search for Tunable and Interpretable Sound Synthesis Based on Frequency Modulation.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023


  Loading...