Rongjie Huang
Orcid: 0000-0002-1695-9000Affiliations:
- Zhejiang University, College of Computer Science and Software, Hangzhou, China
According to our database1,
Rongjie Huang
authored at least 90 papers
between 2021 and 2025.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
Online presence:
-
on orcid.org
On csauthors.net:
Bibliography
2025
CoRR, April, 2025
OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios.
CoRR, January, 2025
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
NAT3DSound: 3D Spatial Sound Field Synthesis with Multi-Modal Non-Autoregressive Transformer.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
2024
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt.
IEEE ACM Trans. Audio Speech Lang. Process., 2024
CoRR, 2024
CoRR, 2024
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
CoRR, 2024
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models.
CoRR, 2024
Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization.
CoRR, 2024
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec.
CoRR, 2024
CoRR, 2024
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.
CoRR, 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment.
CoRR, 2024
Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models.
CoRR, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024
AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Proceedings of the Twelfth International Conference on Learning Representations, 2024
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
Proceedings of the Findings of the Association for Computational Linguistics, 2024
Proceedings of the Findings of the Association for Computational Linguistics, 2024
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Proceedings of the Findings of the Association for Computational Linguistics, 2024
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024
2023
CoRR, 2023
CoRR, 2023
CoRR, 2023
CoRR, 2023
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation.
CoRR, 2023
CoRR, 2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
CoRR, 2023
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt.
CoRR, 2023
UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching.
Proceedings of the 31st ACM International Conference on Multimedia, 2023
Proceedings of the International Conference on Machine Learning, 2023
Proceedings of the Eleventh International Conference on Learning Representations, 2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
VarietySound: Timbre-Controllable Video to Sound Generation Via Unsupervised Information Disentanglement.
Proceedings of the IEEE International Conference on Acoustics, 2023
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
2022
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement.
CoRR, 2022
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis.
CoRR, 2022
M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022
2021
CoRR, 2021
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021