Sheng Zhao

Orcid: 0000-0002-9624-5381

Affiliations:
  • Microsoft Corporation, USA
  • Microsoft Azure Speech, Microsoft Cloud+AI, Beijing, China
  • Microsoft STC Asia, China


According to our database1, Sheng Zhao authored at least 86 papers between 2012 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment.
CoRR, September, 2025

Next Tokens Denoising for Speech Synthesis.
CoRR, July, 2025

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching.
CoRR, June, 2025

Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling.
CoRR, May, 2025

Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment.
CoRR, March, 2025

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

LIMMITS'25: Multilingual Streaming TTS With Neural Codecs for Indian Languages.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Autoregressive Speech Synthesis without Vector Quantization.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024
Memories are One-to-Many Mapping Alleviators in Talking Face Generation.
IEEE Trans. Pattern Anal. Mach. Intell., December, 2024

NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.
IEEE Trans. Pattern Anal. Mach. Intell., June, 2024

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners.
CoRR, 2024

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment.
CoRR, 2024

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers.
CoRR, 2024

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis.
CoRR, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
CoRR, 2024

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like.
CoRR, 2024

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-To-Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Investigating Neural Audio Codecs For Speech Language Model-Based Speech Generation.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

UniAudio: Towards Universal Audio Generation with Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

PromptTTS 2: Describing and Generating Voices with Text Prompt.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

GAIA: Zero-shot Talking Avatar Generation.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023
StableFace: Analyzing and Improving Motion Stability for Talking Face Generation.
IEEE J. Sel. Top. Signal Process., November, 2023

UniAudio: An Audio Foundation Model Toward Universal Audio Generation.
CoRR, 2023

PromptTTS 2: Describing and Generating Voices with Text Prompt.
CoRR, 2023

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.
CoRR, 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling.
CoRR, 2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model.
CoRR, 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.
CoRR, 2023

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Large-Scale Automatic Audiobook Creation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

LeanSpeech: The Microsoft Lightweight Speech Synthesis System for Limmits Challenge 2023.
Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation.
Proceedings of the IEEE International Conference on Acoustics, 2023

Prompttts: Controllable Text-To-Speech With Text Descriptions.
Proceedings of the IEEE International Conference on Acoustics, 2023

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023.
Proceedings of the 18th Blizzard Challenge Workshop, Grenoble, France, August 29, 2023, 2023

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech.
CoRR, 2022

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks.
Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

A Study on the Efficacy of Model Pre-Training In Developing Neural Text-to-Speech System.
Proceedings of the IEEE International Conference on Acoustics, 2022

Transformer-S2A: Robust and Efficient Speech-to-Animation.
Proceedings of the IEEE International Conference on Acoustics, 2022

Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style.
CoRR, 2021

Adaptive Text to Speech for Spontaneous Style.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

AdaSpeech: Adaptive Text to Speech for Custom Voice.
Proceedings of the 9th International Conference on Learning Representations, 2021

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
Proceedings of the 9th International Conference on Learning Representations, 2021

Denoispeech: Denoising Text to Speech with Frame-Level Noise Modeling.
Proceedings of the IEEE International Conference on Acoustics, 2021

Adaspeech 2: Adaptive Text to Speech with Untranscribed Data.
Proceedings of the IEEE International Conference on Acoustics, 2021

Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search.
Proceedings of the IEEE International Conference on Acoustics, 2021

MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network.
Proceedings of the IEEE International Conference on Acoustics, 2021

DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021.
Proceedings of the Blizzard Challenge 2021, virtual, October 23, 2021, 2021

2020
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition.
Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

Enhancing Monotonicity for Robust Autoregressive Transformer TTS.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Semantic Mask for Transformer Based End-to-End Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

A Study of Non-autoregressive Model for Sequence Generation.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

RobuTrans: A Robust Transformer-Based Text-to-Speech Model.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019
FastSpeech: Fast, Robust and Controllable Text to Speech.
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Towards Discriminative Representation Learning for Speech Emotion Recognition.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019

Almost Unsupervised Text to Speech and Automatic Speech Recognition.
Proceedings of the 36th International Conference on Machine Learning, 2019

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

Knowledge Distillation from Bert in Pre-Training and Fine-Tuning for Polyphone Disambiguation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Neural Speech Synthesis with Transformer Network.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018
Close to Human Quality TTS with Transformer.
CoRR, 2018

2012
Turning a Monolingual Speaker into Multilingual for a Mixed-language TTS.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012


  Loading...