Shuai Wang

Haizhou Li

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2025

Multi-Level Speaker Representation for Target Speaker Extraction.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

E1 TTS: Simple and Fast Non-Autoregressive TTS.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching for Speaker Diarization.

[BibT_eX]

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2025

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2025

SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, 2025

2024

Speech Separation With Pretrained Frontend to Minimize Domain Mismatch.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Advancing speaker embedding learning: Wespeaker toolkit for research and production.

[BibT_eX]

[DOI]

Speech Commun., 2024

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues.

[BibT_eX]

[DOI]

CoRR, 2024

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching.

[BibT_eX]

[DOI]

CoRR, 2024

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders.

[BibT_eX]

[DOI]

CoRR, 2024

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, 2024

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

CoRR, 2024

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge.

[BibT_eX]

[DOI]

CoRR, 2024

Fine-Grained Quantitative Emotion Editing for Speech Generation.

[BibT_eX]

[DOI]

CoRR, 2024

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions.

[BibT_eX]

[DOI]

Pengcheng Zhu

Haizhou Li

Proceedings of the Social Robotics - 16th International Conference, 2024

Enhancing Speaker Extraction Through Rectifying Target Confusion.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Attention-Constrained Inference For Robust Decoder-Only Text-to-Speech.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

On the Effectiveness of Enrollment Speech Augmentation For Target Speaker Extraction.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Hierarchical Multi-Path and Multi-Model Selection For Fake Speech Detection.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Disentangling The Prosody And Semantic Information With Pre-Trained Model For In-Context Learning Based Zero-Shot Voice Conversion.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

The X-Lance Technical Report for Interspeech 2024 Speech Processing using Discrete Speech Unit Challenge.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Combining Self-Supervised Learning and Adversarial Training Based Domain Adaptation for Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Diffusion-Based Method with TTS Guidance for Foreign Accent Conversion.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

WenetSpeech4TTS: A 12, 800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

On the Effectiveness of Acoustic BPE in Decoder-Only TTS.

[BibT_eX]

[DOI]

Proceedings of the 25th Annual Conference of the International Speech Communication Association, 2024

AutoPrep: An Automatic Preprocessing Framework for In-The-Wild Speech Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Dualvc 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-Talker Speech.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Fine-Grained Quantitative Emotion Editing for Speech Generation.

[BibT_eX]

[DOI]

Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2024

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge.

[BibT_eX]

[DOI]

Mehmet Sinan Yildirim

Haizhou Li

Mengling Feng

CoRR, 2023

USED: Universal Speaker Extraction and Diarization.

[BibT_eX]

[DOI]

Junyi Ao

Mehmet Sinan Yildirim

CoRR, 2023

Wespeaker baselines for VoxSRC2023.

[BibT_eX]

[DOI]

CoRR, 2023

DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation-based Voice Conversion.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2023

Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

2022

DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Context-aware Multimodal Fusion for Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Self-Knowledge Distillation via Feature Enhancement for Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

On the Importance of Different Frequency Bins for Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

2021

Audio-Visual Deep Neural Network for Robust Person Verification.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2021

Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2021

Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning.

[BibT_eX]

[DOI]

Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Speaker Embedding Augmentation with Noise Distribution Matching.

[BibT_eX]

[DOI]

Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Unit Selection Synthesis Based Data Augmentation for Fixed Phrase Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

SynAug: Synthesis-Based Data Augmentation for Text-Dependent Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

2020

Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2020

End-to-End Speaker-Dependent Voice Activity Detection.

[BibT_eX]

[DOI]

CoRR, 2020

Analysis of ABC Submission to NIST SRE 2019 CMN and VAST Challenge.

[BibT_eX]

[DOI]

Proceedings of the Odyssey 2020: The Speaker and Language Recognition Workshop, 2020

Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Adversarial Domain Adaptation for Speaker Verification Using Partially Shared Network.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Multi-Modality Matters: A Performance Leap on VoxCeleb.

[BibT_eX]

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Text Adaptation for Speaker Verification with Speaker-Text Factorized Embeddings.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

But System for the Second Dihard Speech Diarization Challenge.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Optimizing Bayesian Hmm Based X-Vector Clustering for the Second Dihard Speech Diarization Challenge.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Channel Invariant Speaker Embedding Learning with Joint Multi-Task and Adversarial Training.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Investigation of Specaugment for Deep Speaker Embedding Learning.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2019

BUT System Description to VoxCeleb Speaker Recognition Challenge 2019.

[BibT_eX]

[DOI]

CoRR, 2019

The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

On the Usage of Phonetic Information for Text-Independent Speaker Embedding Extraction.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Cross-Domain Replay Spoofing Attack Detection Using Domain Adversarial Training.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Bayesian HMM Based x-Vector Clustering for Speaker Diarization.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Knowledge Distillation for Small Foot-print Deep Speaker Embedding.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

2018

Erratum to: Past review, current progress, and challenges ahead on the cocktail party problem.

[BibT_eX]

[DOI]

Frontiers Inf. Technol. Electron. Eng., 2018

Past review, current progress, and challenges ahead on the cocktail party problem.

[BibT_eX]

[DOI]

Frontiers Inf. Technol. Electron. Eng., 2018

Generative Adversarial Networks based X-vector Augmentation for Robust Probabilistic Linear Discriminant Analysis in Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition.

[BibT_eX]

[DOI]

Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

Covariance Based Deep Feature for Text-Dependent Speaker Verification.

[BibT_eX]

[DOI]

Proceedings of the Intelligence Science and Big Data Engineering, 2018

Angular Softmax for Short-Duration Text-independent Speaker Verification.

[BibT_eX]

[DOI]

Zili Huang

Kai Yu

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Focal Kl-Divergence Based Dilated Convolutional Neural Networks for Co-Channel Speaker Identification.

[BibT_eX]

[DOI]

Kai Yu

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Joint I-Vector with End-to-End System for Short Duration Text-Independent Speaker Verification.

[BibT_eX]

[DOI]

Zili Huang

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017

What Does the Speaker Embedding Encode?

[BibT_eX]

[DOI]