Rongjie Huang

Orcid: 0000-0002-1695-9000

Affiliations:
  • Zhejiang University, College of Computer Science and Software, Hangzhou, China


According to our database1, Rongjie Huang authored at least 90 papers between 2021 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Versatile Framework for Song Generation with Prompt-based Control.
CoRR, April, 2025

Unleashing the Power of Natural Audio Featuring Multiple Sound Sources.
CoRR, April, 2025

OmniAudio: Generating Spatial Audio from 360-Degree Video.
CoRR, April, 2025

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios.
CoRR, January, 2025

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

NAT3DSound: 3D Spatial Sound Field Synthesis with Multi-Modal Non-Autoregressive Transformer.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup.
CoRR, 2024

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation.
CoRR, 2024

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes.
CoRR, 2024

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
CoRR, 2024

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models.
CoRR, 2024

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization.
CoRR, 2024

MEDIC: Zero-shot Music Editing with Disentangled Inversion Control.
CoRR, 2024

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.
CoRR, 2024

Accompanied Singing Voice Synthesis with Fully Text-controlled Melody.
CoRR, 2024

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT.
CoRR, 2024

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec.
CoRR, 2024

AudioLCM: Text-to-Audio Generation with Latent Consistency Models.
CoRR, 2024

Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching.
CoRR, 2024

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.
CoRR, 2024

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment.
CoRR, 2024

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models.
CoRR, 2024

Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Extending Multi-modal Contrastive Representations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

UniAudio: Towards Universal Audio Generation with Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

InstructSpeech: Following Speech Editing Instructions via Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Robust Singing Voice Transcription Serves Synthesis.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.
CoRR, 2023

Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers.
CoRR, 2023

UniAudio: An Audio Foundation Model Toward Universal Audio Generation.
CoRR, 2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias.
CoRR, 2023

Detector Guidance for Multi-Object Text-to-Image Generation.
CoRR, 2023

Make-A-Voice: Unified Voice Synthesis With Discrete Representation.
CoRR, 2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation.
CoRR, 2023

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec.
CoRR, 2023

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation.
CoRR, 2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.
CoRR, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
CoRR, 2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt.
CoRR, 2023

UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models.
Proceedings of the International Conference on Machine Learning, 2023

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

VarietySound: Timbre-Controllable Video to Sound Generation Via Unsupervised Information Disentanglement.
Proceedings of the IEEE International Conference on Acoustics, 2023

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2022
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement.
CoRR, 2022

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis.
CoRR, 2022

M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis.
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022

2021
SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation.
CoRR, 2021

Bilateral Denoising Diffusion Models.
CoRR, 2021

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021


  Loading...