Shuai Wang

Orcid: 0000-0003-1523-9631

Affiliations:
  • Chinese University of Hong Kong-Shenzhen (CUKH-SZ), Shenzhen Research Institute of Big Data, Shenzhen, China
  • Shanghai Jiao Tong University, Department of Computer Science and Engineering, China (PhD 2020)


According to our database1, Shuai Wang authored at least 108 papers between 2012 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation.
CoRR, August, 2025

Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data.
CoRR, July, 2025

MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions.
CoRR, July, 2025

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization.
CoRR, July, 2025

Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis.
CoRR, July, 2025

Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification.
CoRR, June, 2025

SpeechRefiner: Towards Perceptual Quality Refinement for Front-End Algorithms.
CoRR, June, 2025

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction.
CoRR, June, 2025

SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement.
CoRR, June, 2025

LeVo: High-Quality Song Generation with Multi-Preference Alignment.
CoRR, June, 2025

$C^{2}$AV-TSE: Context and Confidence-Aware Audio Visual Target Speaker Extraction.
IEEE J. Sel. Top. Signal Process., May, 2025

PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs.
CoRR, May, 2025

Causal Self-supervised Pretrained Frontend with Predictive Code for Speech Separation.
CoRR, April, 2025

C<sup>2</sup>/AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction.
CoRR, April, 2025

Context-Aware Two-Step Training Scheme for Domain Invariant Speech Separation.
CoRR, March, 2025

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification.
IEEE Signal Process. Lett., 2025

Multi-Level Speaker Representation for Target Speaker Extraction.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

E1 TTS: Simple and Fast Non-Autoregressive TTS.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching for Speaker Diarization.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
Speech Separation With Pretrained Frontend to Minimize Domain Mismatch.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Advancing speaker embedding learning: Wespeaker toolkit for research and production.
Speech Commun., 2024

Hierarchical Control of Emotion Rendering in Speech Synthesis.
CoRR, 2024

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues.
CoRR, 2024

The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings.
CoRR, 2024

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching.
CoRR, 2024

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders.
CoRR, 2024

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis.
CoRR, 2024

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis.
CoRR, 2024

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge.
CoRR, 2024

Fine-Grained Quantitative Emotion Editing for Speech Generation.
CoRR, 2024

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions.
Proceedings of the Social Robotics - 16th International Conference, 2024

Enhancing Speaker Extraction Through Rectifying Target Confusion.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Attention-Constrained Inference For Robust Decoder-Only Text-to-Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

On the Effectiveness of Enrollment Speech Augmentation For Target Speaker Extraction.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Hierarchical Multi-Path and Multi-Model Selection For Fake Speech Detection.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Disentangling The Prosody And Semantic Information With Pre-Trained Model For In-Context Learning Based Zero-Shot Voice Conversion.
Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

The X-Lance Technical Report for Interspeech 2024 Speech Processing using Discrete Speech Unit Challenge.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Combining Self-Supervised Learning and Adversarial Training Based Domain Adaptation for Speaker Verification.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Diffusion-Based Method with TTS Guidance for Foreign Accent Conversion.
Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

On the Effectiveness of Acoustic BPE in Decoder-Only TTS.
Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

AutoPrep: An Automatic Preprocessing Framework for In-The-Wild Speech Data.
Proceedings of the IEEE International Conference on Acoustics, 2024

Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2024

Dualvc 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion.
Proceedings of the IEEE International Conference on Acoustics, 2024

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-Talker Speech.
Proceedings of the IEEE International Conference on Acoustics, 2024

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2024

Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters.
Proceedings of the IEEE International Conference on Acoustics, 2024

Fine-Grained Quantitative Emotion Editing for Speech Generation.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2024

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge.
CoRR, 2023

USED: Universal Speaker Extraction and Diarization.
CoRR, 2023

Wespeaker baselines for VoxSRC2023.
CoRR, 2023

DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation-based Voice Conversion.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2023

Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit.
Proceedings of the IEEE International Conference on Acoustics, 2023

2022
DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Context-aware Multimodal Fusion for Emotion Recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Self-Knowledge Distillation via Feature Enhancement for Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2022

On the Importance of Different Frequency Bins for Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
Audio-Visual Deep Neural Network for Robust Person Verification.
IEEE ACM Trans. Audio Speech Lang. Process., 2021

Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training.
IEEE ACM Trans. Audio Speech Lang. Process., 2021

Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Speaker Embedding Augmentation with Noise Distribution Matching.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Unit Selection Synthesis Based Data Augmentation for Fixed Phrase Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2021

SynAug: Synthesis-Based Data Augmentation for Text-Dependent Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2021

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2021

2020
Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2020

End-to-End Speaker-Dependent Voice Activity Detection.
CoRR, 2020


Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Adversarial Domain Adaptation for Speaker Verification Using Partially Shared Network.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Multi-Modality Matters: A Performance Leap on VoxCeleb.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Text Adaptation for Speaker Verification with Speaker-Text Factorized Embeddings.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

But System for the Second Dihard Speech Diarization Challenge.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Optimizing Bayesian Hmm Based X-Vector Clustering for the Second Dihard Speech Diarization Challenge.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Channel Invariant Speaker Embedding Learning with Joint Multi-Task and Adversarial Training.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Investigation of Specaugment for Deep Speaker Embedding Learning.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification.
IEEE ACM Trans. Audio Speech Lang. Process., 2019

Erratum to: Past review, current progress, and challenges ahead on the cocktail party problem.
Frontiers Inf. Technol. Electron. Eng., 2019

BUT System Description to VoxCeleb Speaker Recognition Challenge 2019.
CoRR, 2019

The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

On the Usage of Phonetic Information for Text-Independent Speaker Embedding Extraction.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Cross-Domain Replay Spoofing Attack Detection Using Domain Adversarial Training.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Bayesian HMM Based x-Vector Clustering for Speaker Diarization.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Knowledge Distillation for Small Foot-print Deep Speaker Embedding.
Proceedings of the IEEE International Conference on Acoustics, 2019

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

2018
Past review, current progress, and challenges ahead on the cocktail party problem.
Frontiers Inf. Technol. Electron. Eng., 2018

Generative Adversarial Networks based X-vector Augmentation for Robust Probabilistic Linear Discriminant Analysis in Speaker Verification.
Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition.
Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

Covariance Based Deep Feature for Text-Dependent Speaker Verification.
Proceedings of the Intelligence Science and Big Data Engineering, 2018

Angular Softmax for Short-Duration Text-independent Speaker Verification.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Focal Kl-Divergence Based Dilated Convolutional Neural Networks for Co-Channel Speaker Identification.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Joint I-Vector with End-to-End System for Short Duration Text-Independent Speaker Verification.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017
What Does the Speaker Embedding Encode?
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Integrating online i-vector into GMM-UBM for text-dependent speaker verification.
Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017

2012
A Deformable Surface Model for Real-Time Water Drop Animation.
IEEE Trans. Vis. Comput. Graph., 2012


  Loading...