Yong Xu

Orcid: 0000-0003-4944-6890

Affiliations:

Tencent America LLC, Seattle, USA
University of Surrey, Centre for Vision, Speech and Signal Processing, Guildford, UK (former)
University of Science and Technology of China, Hefei, China (PhD 2015)

According to our database¹, Yong Xu authored at least 96 papers between 2012 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Bibliography

2025

Region-Specific Audio Tagging for Spatial Sound.

[BibT_eX]

[DOI]

CoRR, September, 2025

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer.

[BibT_eX]

[DOI]

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

2024

Labelled Non-Zero Diffusion Particle Flow SMC-PHD Filtering for Multi-Speaker Tracking.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2024

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization.

[BibT_eX]

[DOI]

CoRR, 2024

Advancing Multi-Talker ASR Performance With Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Spatialemb: Extract and Encode Spatial Information for 1-Stage Multi-Channel Multi-Speaker ASR on Arbitrary Microphone Arrays.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

LibriheavyMix: A 20, 000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

uSee: Unified Speech Enhancement And Editing with Conditional Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

SPATIALCODEC: Neural Spatial Speech Coding.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Text-Queried Target Sound Event Localization.

[BibT_eX]

[DOI]

Proceedings of the 32nd European Signal Processing Conference, 2024

2023

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions.

[BibT_eX]

[DOI]

CoRR, 2023

Audio Visual Speaker Localization from EgoCentric Views.

[BibT_eX]

[DOI]

CoRR, 2023

Zoneformer: On-device Neural Beamformer For In-car Multi-zone Speech Separation, Enhancement and Echo Cancellation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Deep Neural Mel-Subband Beamformer for in-Car Speech Separation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Neuralecho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network For Acoustic Echo Cancellation and Speech Enhancement.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022

Neural Sound Field Decomposition with Super-resolution of Sound Direction.

[BibT_eX]

[DOI]

CoRR, 2022

NeuralEcho: A Self-Attentive Recurrent Neural Network For Unified Acoustic Echo Suppression And Speech Enhancement.

[BibT_eX]

[DOI]

CoRR, 2022

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Joint Neural AEC and Beamforming with Double-Talk Detection.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Audio-Visual Tracking of Multiple Speakers Via a PMBM Filter.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Visually Assisted Self-supervised Audio Speaker Localization and Tracking.

[BibT_eX]

[DOI]

Proceedings of the 30th European Signal Processing Conference, 2022

2021

Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2021

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2021

Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer.

[BibT_eX]

[DOI]

CoRR, 2021

Generalized RNN beamformer for target speech separation.

[BibT_eX]

[DOI]

CoRR, 2021

WPD++: An Improved Neural Beamformer for Simultaneous Speech Separation and Dereverberation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2021

MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

ADL-MVDR: All Deep Learning MVDR Beamformer for Target Speech Separation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization.

[BibT_eX]

[DOI]

Aswin Shanmugam Subramanian

Proceedings of the IEEE International Conference on Acoustics, 2021

2020

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2020

Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network.

[BibT_eX]

[DOI]

IEEE J. Sel. Top. Signal Process., 2020

Multi-Modal Multi-Channel Target Speech Separation.

[BibT_eX]

[DOI]

IEEE J. Sel. Top. Signal Process., 2020

Audio-Visual Multi-Channel Recognition of Overlapped Speech.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Neural Spatio-Temporal Beamformer for Target Speech Separation.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives.

[BibT_eX]

[DOI]

Aswin Shanmugam Subramanian

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Self-Supervised Learning for Audio-Visual Speaker Diarization.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

Weakly Labelled AudioSet Tagging With Attention Neural Networks.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2019

Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2019

A Unified Framework for Speech Separation.

[BibT_eX]

[DOI]

Fahimeh Bahmaninezhad

CoRR, 2019

End-to-End Multi-Channel Speech Separation.

[BibT_eX]

[DOI]

CoRR, 2019

Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems.

[BibT_eX]

[DOI]

CoRR, 2019

Weakly labelled AudioSet Classification with Attention Neural Networks.

[BibT_eX]

[DOI]

CoRR, 2019

Improved Speaker-Dependent Separation for CHiME-5 Challenge.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

A Comprehensive Study of Speech Separation: Spectrogram vs Waveform Separation.

[BibT_eX]

[DOI]

Fahimeh Bahmaninezhad

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019

Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

Acoustic Scene Generation with Conditional Samplernn.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

An Attention-based Neural Network Approach for Single Channel Speech Enhancement.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

Time Domain Audio Visual Speech Separation.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2018

Auxiliary Features from Laser-Doppler Vibrometer Sensor for Deep Neural Network Based Robust Speech Recognition.

[BibT_eX]

[DOI]

J. Signal Process. Syst., 2018

DCASE 2018 Challenge baseline with convolutional neural networks.

[BibT_eX]

[DOI]

CoRR, 2018

Iterative Deep Neural Networks for Speaker-Independent Binaural Blind Speech Separation.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

A Joint Separation-Classification Model for Sound Event Detection of Weakly Labelled Data.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Audio Set Classification with Attention Model: A Probabilistic Perspective.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Improving Reverberant Speech Separation with Binaural Cues Using Temporal Context and Convolutional Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the Latent Variable Analysis and Signal Separation, 2018

Supporting Audiography: Design of a System for Sentimental Sound Recording, Classification and Playback.

[BibT_eX]

[DOI]

Proceedings of the HCI International 2018, 2018

Capsule Routing for Sound Event Detection.

[BibT_eX]

[DOI]

Proceedings of the 26th European Signal Processing Conference, 2018

DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline.

[BibT_eX]

[DOI]

Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, 2018

2017

Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2017

Hierarchical deep neural network for multivariate regression.

[BibT_eX]

[DOI]

Jun Du

Yong Xu

Pattern Recognit., 2017

Surrey-cvssp system for DCASE2017 challenge task4.

[BibT_eX]

[DOI]

CoRR, 2017

Binaural and log-power spectra features with deep neural networks for speech-noise separation.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE International Workshop on Multimedia Signal Processing, 2017

Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging.

[BibT_eX]

[DOI]

Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Convolutional gated recurrent neural network incorporating spatial features for audio tagging.

[BibT_eX]

[DOI]

Proceedings of the 2017 International Joint Conference on Neural Networks, 2017

A joint detection-classification model for audio tagging of weakly labelled data.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Fast tagging of natural sounds using marginal co-regularization.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Joint detection and classification convolutional neural network on weakly labelled bird audio detection.

[BibT_eX]

[DOI]

Qiuqiang Kong

Yong Xu

Mark D. Plumbley

Proceedings of the 25th European Signal Processing Conference, 2017

2016

Joint training of DNNs by incorporating an explicit dereverberation structure for distant speech recognition.

[BibT_eX]

[DOI]

EURASIP J. Adv. Signal Process., 2016

Hierachical learning for DNN-based acoustic scene classification.

[BibT_eX]

[DOI]

CoRR, 2016

Fully Deep Neural Networks Incorporating Unsupervised Feature Learning for Audio Tagging.

[BibT_eX]

[DOI]

CoRR, 2016

Deep neural network for robust speech recognition with auxiliary features from laser-Doppler vibrometer sensor.

[BibT_eX]

[DOI]

Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

Hierarchical Learning for DNN-Based Acoustic Scene Classification.

[BibT_eX]

[DOI]

Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, 2016

Fully DNN-Based Multi-Label Regression for Audio Tagging.

[BibT_eX]

[DOI]

Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, 2016

2015

A Regression Approach to Speech Enhancement Based on Deep Neural Networks.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2015

Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement.

[BibT_eX]

[DOI]

Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech.

[BibT_eX]

[DOI]

Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments.

[BibT_eX]

[DOI]

Tian Gao

Jun Du