Sheng Zhao

Orcid: 0000-0002-9624-5381

Affiliations:

Microsoft Corporation, USA
Microsoft Azure Speech, Microsoft Cloud+AI, Beijing, China
Microsoft STC Asia, China

According to our database¹, Sheng Zhao authored at least 87 papers between 2012 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates.

[BibT_eX]

[DOI]

CoRR, October, 2025

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment.

[BibT_eX]

[DOI]

CoRR, September, 2025

Next Tokens Denoising for Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, July, 2025

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching.

[BibT_eX]

[DOI]

CoRR, June, 2025

Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling.

[BibT_eX]

[DOI]

CoRR, May, 2025

Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment.

[BibT_eX]

[DOI]

CoRR, March, 2025

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

LIMMITS'25: Multilingual Streaming TTS With Neural Codecs for Indian Languages.

[BibT_eX]

[DOI]

Mark Hasegawa-Johnson

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Autoregressive Speech Synthesis without Vector Quantization.

[BibT_eX]

[DOI]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

Memories are One-to-Many Mapping Alleviators in Talking Face Generation.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., December, 2024

NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., June, 2024

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners.

[BibT_eX]

[DOI]

CoRR, 2024

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment.

[BibT_eX]

[DOI]

CoRR, 2024

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers.

[BibT_eX]

[DOI]

CoRR, 2024

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2024

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like.

[BibT_eX]

[DOI]

CoRR, 2024

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-To-Speech.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Investigating Neural Audio Codecs For Speech Language Model-Based Speech Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

UniAudio: Towards Universal Audio Generation with Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

PromptTTS 2: Describing and Generating Voices with Text Prompt.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

GAIA: Zero-shot Talking Avatar Generation.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation.

[BibT_eX]

[DOI]

IEEE J. Sel. Top. Signal Process., November, 2023

UniAudio: An Audio Foundation Model Toward Universal Audio Generation.

[BibT_eX]

[DOI]

CoRR, 2023

PromptTTS 2: Describing and Generating Voices with Text Prompt.

[BibT_eX]

[DOI]

CoRR, 2023

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

[BibT_eX]

[DOI]

CoRR, 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling.

[BibT_eX]

[DOI]

CoRR, 2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model.

[BibT_eX]

[DOI]

CoRR, 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.

[BibT_eX]

[DOI]

CoRR, 2023

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Large-Scale Automatic Audiobook Creation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

LeanSpeech: The Microsoft Lightweight Speech Synthesis System for Limmits Challenge 2023.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Prompttts: Controllable Text-To-Speech With Text Descriptions.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023.

[BibT_eX]

[DOI]

Proceedings of the 18th Blizzard Challenge Workshop, Grenoble, France, August 29, 2023, 2023

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2022

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech.

[BibT_eX]

[DOI]

CoRR, 2022

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

A Study on the Efficacy of Model Pre-Training In Developing Neural Text-to-Speech System.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Transformer-S2A: Robust and Efficient Speech-to-Animation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

2021

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style.

[BibT_eX]

[DOI]

CoRR, 2021

Adaptive Text to Speech for Spontaneous Style.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

AdaSpeech: Adaptive Text to Speech for Custom Voice.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

Denoispeech: Denoising Text to Speech with Frame-Level Noise Modeling.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Adaspeech 2: Adaptive Text to Speech with Untranscribed Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021.

[BibT_eX]

[DOI]

Proceedings of the Blizzard Challenge 2021, virtual, October 23, 2021, 2021

2020

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition.

[BibT_eX]

[DOI]

Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

Enhancing Monotonicity for Robust Autoregressive Transformer TTS.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.

[BibT_eX]

[DOI]

Sarangarajan Parthasarathy

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Semantic Mask for Transformer Based End-to-End Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

A Study of Non-autoregressive Model for Sequence Generation.

[BibT_eX]

[DOI]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

RobuTrans: A Robust Transformer-Based Text-to-Speech Model.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019

FastSpeech: Fast, Robust and Controllable Text to Speech.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Towards Discriminative Representation Learning for Speech Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019

Almost Unsupervised Text to Speech and Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 36th International Conference on Machine Learning, 2019

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

Knowledge Distillation from Bert in Pre-Training and Fine-Tuning for Polyphone Disambiguation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Neural Speech Synthesis with Transformer Network.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018

Close to Human Quality TTS with Transformer.

[BibT_eX]

[DOI]

CoRR, 2018

2012

Turning a Monolingual Speaker into Multilingual for a Mixed-language TTS.

[BibT_eX]

[DOI]

Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Sheng Zhao

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...