Shan Yang

Orcid: 0000-0003-4464-146X

Affiliations:
  • Tencent AI Lab, Beijing, China
  • Northwestern Polytechnical University, School of Computer Science, Xi'an, China (PhD)


According to our database1, Shan Yang authored at least 50 papers between 2016 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
Covo-Audio Technical Report.
CoRR, February, 2026

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation.
Proceedings of the Fortieth AAAI Conference on Artificial Intelligence, 2026

2025
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation.
CoRR, December, 2025

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering.
CoRR, August, 2025

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation.
CoRR, June, 2025

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts.
CoRR, January, 2025

AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation.
Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition.
Proceedings of the 33rd ACM International Conference on Multimedia, 2025

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning.
Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model.
Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

UniSep: Universal Target Audio Separation with Language Models at Scale.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Sinba: Singing-To-Accompaniment Generation With Pitch Guidance Via Mamba-Based Language Model.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2025

2023
A High Fidelity and Low Complexity Neural Audio Coding.
CoRR, 2023

Multi-mode Neural Speech Coding Based on Deep Generative Networks.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis.
IEEE Signal Process. Lett., 2022

End-to-End Voice Conversion with Information Perturbation.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

VCVTS: Multi-Speaker Video-to-Speech Synthesis Via Cross-Modal Knowledge Transfer from Voice Conversion.
Proceedings of the IEEE International Conference on Acoustics, 2022

Referee: Towards Reference-Free Cross-Speaker Style Transfer with Low-Quality Data for Expressive Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
Effective and direct control of neural TTS prosody by removing interactions between different attributes.
Neural Networks, 2021

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Accent and Speaker Disentanglement in Many-to-many Voice Conversion.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Controllable Emotion Transfer For End-to-End Speech Synthesis.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Controllable Context-Aware Conversational Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN.
Proceedings of the ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, October 18, 2021

2020
Adversarial Feature Learning and Unsupervised Clustering Based Speech Synthesis for Found Data With Acoustic and Textual Noise.
IEEE Signal Process. Lett., 2020

On the localness modeling for the self-attention based end-to-end speech synthesis.
Neural Networks, 2020

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training.
CoRR, 2020

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

The NUS & NWPU system for Voice Conversion Challenge 2020.
Proceedings of the Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020

2019
Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis.
IEEE Access, 2019

Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2019

SZ-NPU Team's Entry to Blizzard Challenge 2019.
Proceedings of the Blizzard Challenge 2019, Vienna, Austria, September 23, 2019, 2019

Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Improving Mandarin End-to-End Speech Synthesis by Self-Attention and Learnable Gaussian Bias.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2018
The I2R-NWPU-NUS Text-to-Speech System for Blizzard Challenge 2018.
Proceedings of the Blizzard Challenge 2018, Hyderabad, India, September 8, 2018, 2018

2017
The I2R-NWPU Text-to-Speech System for Blizzard Challenge 2017.
Proceedings of the Blizzard Challenge 2017, Stockholm, Sweden, August 25, 2017, 2017

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

2016
A deep bidirectional LSTM approach for video-realistic talking head.
Multim. Tools Appl., 2016

The I2R-NWPU-NTU Text-to-Speech System at Blizzard Challenge 2016.
Proceedings of the Blizzard Challenge 2016, Cuppertino, CA, USA, September 16, 2016, 2016

On the training of DNN-based average voice model for speech synthesis.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2016


  Loading...