We stand with Ukraine

We stand with Ukraine

Yuxuan Wang

Affiliations:

ByteDance AI Lab, Mountain View, CA, USA
Google, Mountain View, CA, USA
Ohio State University, Columbus, OH, USA (former, PhD)

According to our database¹, Yuxuan Wang authored at least 108 papers between 2012 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

Online presence:

on linkedin.com
on scholar.google.com

On csauthors.net:

Bibliography

2025

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

CoRR, October, 2025

Heptapod: Language Modeling on Visual Signals.

[DOI]

,

,

,

,

,

,

,

,

CoRR, October, 2025

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models.

[DOI]

,

,

,

,

,

,

Mark D. Plumbley

,

CoRR, September, 2025

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, July, 2025

MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

CoRR, June, 2025

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation.

[DOI]

,

,

,

,

,

,

,

,

,

CoRR, May, 2025

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Emmanouil Benetos

,

,

,

CoRR, May, 2025

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context.

[DOI]

,

,

,

,

,

,

,

,

CoRR, March, 2025

Sounding that Object: Interactive Object-Aware Image to Audio Generation.

[DOI]

,

,

,

,

,

,

,

Gopala Anumanchipalli

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Sound-VECaps: Improving Audio Generation with Visually Enhanced Captions.

[DOI]

,

,

,

,

,

,

,

,

,

Mark D. Plumbley

,

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions.

[DOI]

,

,

,

,

,

,

,

Junichi Yamagishi

,

,

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Towards Reliable Large Audio Language Model.

[DOI]

,

,

,

,

,

,

,

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics, 2025

Language Model Can Listen While Speaking.

[DOI]

,

,

,

,

,

,

,

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024

AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining.

[DOI]

,

,

,

,

,

,

,

,

,

Mark D. Plumbley

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing.

[DOI]

,

,

,

,

,

,

,

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2024

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation.

[DOI]

,

,

,

,

,

,

,

,

,

CoRR, 2024

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation.

[DOI]

,

,

,

,

,

,

Lamtharn Hantrakul

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Janne Spijkervet

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2024

NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training.

[DOI]

,

,

,

,

,

,

,

,

CoRR, 2024

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2024

Improving Audio Generation with Visual Enhanced Caption.

[DOI]

,

,

,

,

,

,

,

,

,

Mark D. Plumbley

,

CoRR, 2024

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR.

[DOI]

,

,

,

,

,

,

CoRR, 2024

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models.

[DOI]

Philip Anastassiou

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2024

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing.

[DOI]

Philip Anastassiou

,

,

,

,

,

,

,

,

CoRR, 2024

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

Can Large Language Models Understand Spatial Audio?

[DOI]

,

,

,

,

,

,

,

,

,

,

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the Forty-first International Conference on Machine Learning, 2024

PolyVoice: Language Models for Speech to Speech Translation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

A Unified Front-End Framework for English Text-to-Speech Synthesis.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2024

Audio Prompt Tuning for Universal Sound Separation.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2024

2023

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models.

[DOI]

,

,

,

,

,

,

,

,

CoRR, 2023

Separate Anything You Describe.

[DOI]

,

,

,

,

,

,

,

,

Mark D. Plumbley

,

CoRR, 2023

PolyVoice: Language Models for Speech to Speech Translation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2023

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition.

[DOI]

,

,

,

Chuanzeng Huang

,

CoRR, 2023

a unified front-end framework for english text-to-speech synthesis.

[DOI]

,

,

,

,

,

,

CoRR, 2023

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing.

[DOI]

,

,

,

,

,

,

,

,

,

CoRR, 2023

Efficient Neural Music Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network.

[DOI]

,

,

,

,

,

,

,

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Language-universal Phonetic Encoder for Low-resource Speech Recognition.

[DOI]

,

,

,

Chuanzeng Huang

,

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Memory Augmented Lookup Dictionary Based Language Modeling for Automatic Speech Recognition.

[DOI]

,

,

,

Chuanzeng Huang

,

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Streaming Voice Conversion via Intermediate Bottleneck Features and Non-Streaming Teacher Guidance.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2023

2022

GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music.

[DOI]

,

,

,

Trans. Int. Soc. Music. Inf. Retr., 2022

Neural Sound Field Decomposition with Super-resolution of Sound Direction.

[DOI]

,

,

,

,

,

,

,

CoRR, 2022

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech.

[DOI]

,

,

,

,

,

,

,

CoRR, 2022

Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network.

[DOI]

,

,

,

,

Chengshuai Zhao

,

,

Chuanzeng Huang

,

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration.

[DOI]

,

,

,

,

,

,

Chuanzeng Huang

,

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Neufa: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

Cloning One's Voice Using Very Limited Data in the Wild.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

2021

High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times.

[DOI]

,

,

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2021

Neural Dubber: Dubbing for Silent Videos According to Scripts.

[DOI]

,

,

,

,

,

CoRR, 2021

VoiceFixer: Toward General Speech Restoration With Neural Vocoder.

[DOI]

,

,

,

,

,

Chuanzeng Huang

,

CoRR, 2021

Joint Echo Cancellation and Noise Suppression based on Cascaded Magnitude and Complex Mask Estimation.

[DOI]

,

,

,

,

,

Chuanzeng Huang

,

CoRR, 2021

CatNet: music source separation system with mix-audio augmentation.

[DOI]

,

,

,

CoRR, 2021

Neural Dubber: Dubbing for Videos According to Scripts.

[DOI]

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation.

[DOI]

,

,

,

,

Proceedings of the 22nd International Society for Music Information Retrieval Conference, 2021

Listen, Read, and Identify: Multimodal Singing Language Identification of Music.

[DOI]

,

Proceedings of the 22nd International Society for Music Information Retrieval Conference, 2021

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Speech Enhancement with Weakly Labelled Data from AudioSet.

[DOI]

,

,

,

,

,

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Supervised Chorus Detection for Popular Music Using Convolutional Neural Network and Multi-Task Learning.

[DOI]

,

Jordan B. L. Smith

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Modeling the Compatibility of Stem Tracks to Generate Music Mashups.

[DOI]

,

,

Jordan B. L. Smith

,

,

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition.

[DOI]

,

,

,

,

,

Mark D. Plumbley

IEEE ACM Trans. Audio Speech Lang. Process., 2020

Adversarial Feature Learning and Unsupervised Clustering Based Speech Synthesis for Found Data With Acoustic and Textual Noise.

[DOI]

,

,

IEEE Signal Process. Lett., 2020

Large-Scale MIDI-based Composer Classification.

[DOI]

,

,

CoRR, 2020

High-resolution Piano Transcription with Pedals by Regressing Onsets and Offsets Times.

[DOI]

,

,

,

,

CoRR, 2020

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement.

[DOI]

,

,

,

,

,

,

,

CoRR, 2020

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech.

[DOI]

,

,

,

,

,

,

,

,

CoRR, 2020

A Hybrid Text Normalization System Using Multi-Head Self-Attention For Mandarin.

[DOI]

,

,

,

,

,

,

,

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis.

[DOI]

,

,

,

,

,

,

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis.

[DOI]

,

,

,

,

,

Mark D. Plumbley

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos.

[DOI]

,

,

,

,

Ashok K. Krishnamurthy

Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), 2020

Xiaomingbot: A Multilingual Robot News Reporter.

[DOI]

,

,

,

,

,

,

,

,

,

,

Songcheng Jiang

,

,

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020

2019

Hierarchical Generative Modeling for Controllable Speech Synthesis.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 7th International Conference on Learning Representations, 2019

Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2019

Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis.

[DOI]

,

,

,

,

R. J. Skerry-Ryan

Proceedings of the IEEE International Conference on Acoustics, 2019

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis.

[DOI]

,

,

,

,

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2018

Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis.

[DOI]

,

,

R. J. Skerry-Ryan

Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.

[DOI]

,

,

,

R. J. Skerry-Ryan

,

Eric Battenberg

,

,

,

,

,

Proceedings of the 35th International Conference on Machine Learning, 2018

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron.

[DOI]

R. J. Skerry-Ryan

,

Eric Battenberg

,

,

,

,

,

,

,

Proceedings of the 35th International Conference on Machine Learning, 2018

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.

[DOI]

,

,

,

,

,

,

,

,

,

R. J. Skerry-Ryan

,

,

Yannis Agiomyrgiannakis

,

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017

Uncovering Latent Style Factors for Expressive Speech Synthesis.

[DOI]

,

R. J. Skerry-Ryan

,

,

,

,

Eric Battenberg

,

,

CoRR, 2017

Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model.

[DOI]

,

R. J. Skerry-Ryan

,

,

,

,

,

,

,

,

,

,

Yannis Agiomyrgiannakis

,

,

CoRR, 2017

Tacotron: Towards End-to-End Speech Synthesis.

[DOI]

,

R. J. Skerry-Ryan

,

,

,

,

,

,

,

,

,

,

Yannis Agiomyrgiannakis

,

,

Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Trainable frontend for robust and far-field keyword spotting.

[DOI]

,

Pascal Getreuer

,

,

Richard F. Lyon

,

Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

2016

Complex Ratio Masking for Monaural Speech Separation.

[DOI]

Donald S. Williamson

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2016

Noise perturbation for supervised speech separation.

[DOI]

,

,

Speech Commun., 2016

Complex ratio masking for joint enhancement of magnitude and phase.

[DOI]

Donald S. Williamson

,

,

Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015

Cochannel Speaker Identification in Anechoic and Reverberant Conditions.

[DOI]

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2015

Learning Spectral Mapping for Speech Dereverberation and Denoising.

[DOI]

,

,

,

William S. Woods

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2015

Deep neural networks for cochannel speaker identification.

[DOI]

,

,

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Deep neural networks for estimating speech model activations.

[DOI]

Donald S. Williamson

,

,

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

A deep neural network for time-domain signal reconstruction.

[DOI]

,

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Noise Perturbation Improves Supervised Speech Separation.

[DOI]

,

,

Proceedings of the Latent Variable Analysis and Signal Separation, 2015

2014

On training targets for supervised speech separation.

[DOI]

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2014

A feature study for classification-based speech separation at low signal-to-noise ratios.

[DOI]

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2014

Robust speaker identification in noisy and reverberant conditions.

[DOI]

,

,

Proceedings of the IEEE International Conference on Acoustics, 2014

A two-stage approach for improving the perceptual quality of separated speech.

[DOI]

Donald S. Williamson

,

,

Proceedings of the IEEE International Conference on Acoustics, 2014

A structure-preserving training target for supervised speech separation.

[DOI]

,

Proceedings of the IEEE International Conference on Acoustics, 2014

Learning spectral mapping for speech dereverberation.

[DOI]

,

,

Proceedings of the IEEE International Conference on Acoustics, 2014

A feature study for classification-based speech separation at very low signal-to-noise ratio.

[DOI]

,

,

Proceedings of the IEEE International Conference on Acoustics, 2014

2013

Towards Scaling Up Classification-Based Speech Separation.

[DOI]

,

IEEE Trans. Speech Audio Process., 2013

Exploring Monaural Features for Classification-Based Speech Segregation.

[DOI]

,

,

IEEE Trans. Speech Audio Process., 2013

A sparse representation approach for perceptual quality improvement of separated speech.

[DOI]

Donald S. Williamson

,

,

Proceedings of the IEEE International Conference on Acoustics, 2013

Feature denoising for speech separation in unknown noisy environments.

[DOI]

,

Proceedings of the IEEE International Conference on Acoustics, 2013

2012

Cocktail Party Processing via Structured Prediction.

[DOI]

,

Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012

Boosting Classification Based Speech Separation Using Temporal Dynamics.

[DOI]

,

Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Acoustic Features for Classification Based Speech Separation.

[DOI]

,

,

Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Loading...